Neural Network Models in Combinatorial Optimization

Syed, Mujahid N.; Pardalos, Panos M.

doi:10.1007/978-1-4419-7997-1_65

Mujahid N. Syed⁴ &
Panos M. Pardalos⁵

7808 Accesses
1 Citations

Abstract

This chapter reviews the theory and application of artificial neural network (ANN) models with the intention of solving combinatorial optimization problems (COPs). Brief introductions to the theory of ANNs and to the classical models of ANNs applied to COPs are presented at the beginning of this chapter. Since the classical ANN models follow gradient-based search, they usually converge to a local optimal solution. To overcome this, several methods that extend the capability of ANNs to avoid the local minima have been reviewed in this chapter. Apart from that, not all the ANNs converge to a local minimum; thus, stability and/or convergence criteria of various ANNs have been addressed. The thin wafer that divides continuous and discrete optimization problems while applying ANNs to solve the COPs is highlighted. Applications of ANNs to solve the general optimization problems and to solve the discrete optimization problems have been surveyed. To conclude, issues regarding the performance behavior of the ANNs are discussed at the end of this chapter.

Access provided by Autonomous University of Puebla. Download reference work entry PDF

Deterministic Global Optimization with Artificial Neural Networks Embedded

Article 12 October 2018

The Convergence of Incremental Neural Networks

Article 13 October 2023

Local Search is a Remarkably Strong Baseline for Neural Architecture Search

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Combinatorial optimization problems (COPs) are special problems in the field of operations research, where a real-valued mathematical function is optimized over a given domain (feasible set). Either some or all the variables of these problems are discrete in nature. The main criterion in proposing a solution method to solve the COPs is to provide a solution strategy better than the complete enumeration method. The conventional methods that are used to find the solution of a COP include branching, bounding, cutting planes, etc. Usually, COPs are NP-Hard in nature; the solution time of the conventional solution methods grows exponentially with the problem size. A curiosity to improve the solution time of finding an optimal solution of a COP directed the research toward unconventional methods. The primary unconventional methods for solving the COPs are the heuristics, and the most prominent ones among them are the metaheuristics, such as simulated annealing, genetic and evolutionary algorithms, tabu search, and greedy randomized adaptive search procedure [15, 38, 44, 46, 53, 87, 94, 98, 115]. On the other hand, there were simultaneous research [32, 112] to use the analog circuits instead of the conventional computers for solving COPs. A method for solving the linear and the quadratic programs using resistors, diodes, transformers, current sources, and voltage sources (the basic building blocks of the analog circuit) was proposed in [32]. However, this method was impractical due to the unavailability of an ideal diode. Thus, this method remained in the theory until improved methods were proposed in [23, 62]. The first known analog circuit implementations which were capable of solving linear and quadratic programming problems were presented in [24, 131]. However, these methods were not ideal due to the use of negative resistors [131]. Later in 1985, Hopfield and Tank [121] proposed an analog circuit consisting of neurons,^{Footnote 1} called artificial neural networks (ANNs), to solve the COPs. This method caught the attention of many researchers, when Hopfield and Tank [121] solved some sufficiently large instances of the symmetric traveling salesman problem (TSP). The proposed method illustrated an alternate approach to solve the COPs, which is independent of the digital computers. After this illustration, many researchers were curious to utilize ANNs as a tool to solve various COPs.

1.1 Objective

The main objective of this chapter is to present different ideas of applying ANNs in solving the COPs. Furthermore, the issues related to the optimality and/or the stability analysis of various ANNs will be dealt. In addition to that, the implementation and the mapping of ANNs to solve the COPs will be illustrated by examples. Nevertheless, a brief history about the ANNs will be reviewed. In addition to that, usage of ANNs in solving the general optimization problems will be addressed.

1.2 Outline

In this chapter, ANN models which are applied in solving the COPs are described. This chapter is organized as follows: an introduction of ANNs in general and a review of the classical models in ANNs that are suitable for solving the COPs in particular are presented in Sect. 2. A systematic method, which illustrates the stability and convergence analysis of ANNs, is described in Sect. 3. Subsequently, the extensions and the improvements that can be incorporated to the classical ANN models are presented in Sect. 4. In addition to that, a methodology of mapping and solving some of the general optimization problems is discussed in Sect. 5. Moreover, the applications of ANNs for solving various discrete programming problems are surveyed in Sect. 6. Nonetheless, in Sect. 7, the criticism of using ANNs to solve the COPs is presented. Finally, this chapter ends with the concluding remarks in Sect. 8.

2 Review

Human brain and its functionality has been the topic of research for many years, and until now, there has been no model that could aptly imitate the human brain. Curiosity of studying human brain lead to the development of ANNs. Henceforth, ANNs are the mathematical models mimicking the interactions among the neurons^{Footnote 2} within the human brain. Initially, ANNs were developed to provide a novel solution approach to solve the problems which do not have any algorithmic solution approach. The first model that depicted a working ANN used the concept of perceptron^{Footnote 3} [82, 101]. After the invention of perceptron, ANNs were the area of interest for many researchers for almost 10 years. However, this area of research was crestfallen due to the results shown by Minsky and Papert in [85]. Their results showed that the perceptron is incapable to model the simple logical operations. However, their work also indirectly suggested the usage of multilayer ANNs. After a decade, as a requital, this area of research became popular with Hopfield’s feedback^{Footnote 4} ANN [56, 57]. A pioneer practical approach to solve the COPs was presented in [58, 121]. The mathematical analysis illustrated by Hopfield demonstrated that ANNs can be used as a gradient descent method in solving optimization problems. Moreover, some instances of traveling salesman problem (TSP) were mapped and solved using ANNs. This approach kindled enthusiasm among many researchers to provide an efficient solution methodology for solving NP-Hard problems. Since then, ANNs were modified and applied in numerous ways to solve the COPs.

2.1 Artificial Neural Networks (ANNs)

Neurons are the primary elements of any ANN. They posses some memory represented by their state. The state of a neuron is represented either by a single state or a dual state (internal and external state). The state of a neuron can take any real value between the interval [0, 1].^{Footnote 5} Furthermore, the neurons interact with each other via links to share the available information. In general, the external state of the neuron takes part in the outer interactions, whereas the internal state of the neuron takes part in the firing^{Footnote 6} decision. The intensity and sense of interactions between any two connecting neurons are represented by the weight (or synaptic^{Footnote 7} weight) of the links.

A single neuron itself is incapable to process any complex information. However, it is the collective behavior of the neurons that results in the intelligent information processing. Usually, neurons update their states based on the weighted cumulative effect of the states of the surrounding neurons. The update rule for the states is

$$\displaystyle\begin{array}{rcl} \mathcal{I}_{i} =\displaystyle\sum _{ j=1,j\neq i}^{n}w_{ ij}x_{i} - U_{i}& &\end{array}$$

(1)

$$\displaystyle\begin{array}{rcl} x_{i} = \Psi \left (\mathcal{I}_{i}\right )& &\end{array}$$

(2)

where x _i and $\mathcal{I}_{i}$ represent the external and internal states of an $i\mathrm{th}$ neuron, respectively, w _ij represents the weight between the $i\mathrm{th}$ neuron and the $j\mathrm{th}$ neuron, and U _i represents the threshold level of the $i\mathrm{th}$ neuron. Function $\Psi ()$ is called the transfer function.^{Footnote 8} If the collective input to the $i\mathrm{th}$ neuron $\left (\sum _{j=1,j\neq i}^{n}w_{ij}x_{i}\right )$ is greater than the threshold (U _i), then the neuron will fire (value of x _i is set as 1). This is the basic mechanism of any ANN. With appropriate network structure and link weights, ANNs are capable of making any logical operations.

Example 1

Consider the following neural networks:

In Fig. 1, a simple neural network, which performs logical AND and logical OR operations, is depicted. The number over the links represents the weight of the link. The number inside the neurons represents the threshold value. Each neural network has two binary input signals and one binary output signal. Similarly, in Fig. 2, another elementary logical operator XOR is presented (where XOR is logical exclusive OR operator). From this figure, it can be seen that for a specific operation, neural network may not have a unique representation.

From Figs. 1 and 2, it can be seen that although a single neuron does not possess the computation ability, collectively, they can perform any logical operations very easily. Moreover, these are just the elementary form of logical operations that can be performed using simple neural networks. ANNs can be used for the complex logical operations as well. In order to use ANNs for complex logical operations, feedback method is used. Based on the feed mechanism, ANNs can be classified as feed-forward neural networks and Feedback neural networks as shown in Fig. 3.

In the feed-forward neural networks, information is processed only in one direction. Whereas in the case of feedback neural networks, information is processed in both the directions. From the literature, typically, there are three main types of applications of ANNs in information processing. They are categorized as:

Type 1: Pattern Recognition
- This is the most explored application of the ANNs [14, 19, 88, 89]. In pattern recognition, two sets of data, namely, training data, and testing data are given. Generally, the data sets consists of input parameters which are believed to be the cause of the output features. The aim of pattern recognition is to find a relation that maps the input parameters with the output features. This is obtained by learning, where the weights of the neural network are updated by error backpropagation. When the error between the generated features and the required output features falls below some acceptable limit, the weights are frozen or set to the current value. These weight are used to validate the performance of the obtained neural network using the testing data.This approach is used in function approximation, data prediction, curve fitting, etc.
Type 2: Associative Memory
- Advancement in the theory of ANNs from feed-forward neural networks to feedback neural networks led to this application [18, 84, 93]. In associative memory application, a neural network is set to the known target memory pattern. Later, if an incomplete or partly erroneous pattern is presented to the trained ANN, then the ANN is capable to generate or repair the input pattern to the most resembling target pattern.
  
  This approach is used in handwriting recognition, image processing, etc.
Type 3: Optimization
- Further extensions of associative memory approach led to the application of ANNs in solving the optimization problems. This application is fundamentally different from the pattern recognition application. In this approach, weights of the neural network are fixed, and the inputs are randomly (synchronously or asynchronously) updated such that the total energy of the network is minimized. This minimization is achieved if the network is stable. Based on the different optimality conditions, the system may converge to the local (or global) optimal solution. ANNs have been applied not only for solving the COPs [3, 43, 64, 97, 110, 111] but also for solving the optimization problems in general [39, 86, 90, 104]

In the following subsection, an example implementation of ANNs for solving a general linear programming (LP) problem is presented.

2.2 Example: Implementation

One of the conventional methods of solving any COP is to relax the problem into a series of continuous problems, and solve them until the desired discrete solution is obtained [47]. Although simplex is the most widely used method to solve the linear relaxation of COPs, it has been shown in the literature that other methods like dual-simplex, barrier methods, and interior point methods perform better than simplex on certain occasions [66, 99, 133]. Apart from that, the convergence time of simplex method is exponential in time [69], yet in practice, it performs better than polynomial time interior point method. Thus, finding a polynomial time algorithm (polynomial both in theory and in practice) for solving linear programming problems is still an attractive area of research. Therefore, it becomes an attractive choice to study the method of solving an LP problem using an ANN. Apart from that, the purpose of selecting an LP model to show the implementation of ANNs is due to the simple nature of LP models. Thus, a reader familiar with the theory of LP will find it easy to understand the analog circuit of the LP problem.

Consider a standard form of a linear programming problem described as

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ &{ \mathbf{c}}^{T}\mathbf{x}& \\ \mathrm{subject\;to} :& & \\ A\mathbf{x}& = \mathbf{b}& \\ & & \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} 0 \leq \,\,& x_{i} \leq x_{\mathrm{max}}\quad \forall i = 1,2,\ldots,n&\end{array}$$

(3)

where $\mathbf{c} \in {\mathbb{R}}^{n}$, $\mathbf{b} \in {\mathbb{R}}^{m}$, $A \in {\mathbb{R}}^{m\times n}$, and $\mathbf{x} \in {\mathbb{R}}^{n}$. In order to apply an ANN to solve this problem, the above stated problem has to be transformed into an unconstrained optimization problem using either a penalty function or a Lagrange function method. For the sake of simple illustration, a penalty function approach will be presented. Let $\mathbf{x}(t)$ represent the instantaneous or dynamic value of the solution vector at time t. Let $E(\mathbf{x}(t),t)$ represent the modified (penalized) objective function, which is given as

$$E(\mathbf{x}(t),t) = \frac{C_{1}} {2} {(A\mathbf{x}(t) -\mathbf{b})}^{T}(A\mathbf{x}(t) -\mathbf{b}) + C_{ 2}{\mathbf{c}}^{T}(\mathbf{\overline{x}} + \mathbf{x}(t){e}^{-C_{3}t})$$

(4)

where $C_{1},C_{2}$, and C ₃ represent the positive scaling parameters of the system. Let the system dynamics be defined as

$$\nabla _{t}\mathcal{I} = -\nabla _{\mathbf{x}(t)}E(\mathbf{x}(t),t)$$

(5)

where $\mathcal{I}$ represents the internal energy of the neuron. Based on the system dynamics defined in Eq. (5), the system can be described as

$$\nabla _{t}\mathcal{I} = -C_{1}{A}^{T}A\mathbf{x}(t) + C_{ 1}{A}^{T}\mathbf{b} - C_{ 2}\mathbf{c}\;{e}^{-C_{3}t}$$

(6)

and

$$x_{i}(t) = \frac{x_{\mathrm{max}}} {1 + {e}^{(-C_{4}\mathcal{I}(t))}}$$

(7)

where C ₄ is a positive scaling parameter,

$$\nabla _{t}\mathcal{I} ={ \left (\frac{d\mathcal{I}_{1}} {dt}, \frac{d\mathcal{I}_{2}} {dt},\ldots, \frac{d\mathcal{I}_{n}} {dt} \right )}^{T},\quad \mathcal{I}\in {\mathbb{R}}^{n}$$

and

$$\nabla _{\mathbf{x}(t)}E(\mathbf{x}(t),t) ={ \left (\frac{\partial E(\mathbf{x}(t),t)} {\partial x_{1}}, \frac{\partial E(\mathbf{x}(t),t)} {\partial x_{2}},\ldots, \frac{\partial E(\mathbf{x}(t),t)} {\partial x_{n}} \right )}^{T}$$

Equations (6) and (7) are the main equations which are used to design the ANN. The first thing in designing the neural network is to know the number of neurons. Since each variable is represented by a neuron, there will be altogether n neurons required for the construction of ANN. Let $\mathcal{I}_{i}$ represent the internal state of an $i\mathrm{th}$ neuron and x _i represent the external state of an $i\mathrm{th}$ neuron. At any given time, both the states of an $i\mathrm{th}$ neuron can be represented by $[\mathcal{I}_{i}(t),x_{i}(t)]$. The next step will be to connect the network using links. These links will have resistors, which represent the weights (synaptic weights) of the link. Based on the coefficients in Eq. (6), the weights are taken as the corresponding coefficients of the first-order term. The constant term in Eq. (6) is taken as the input bias to the neurons. From the Eq. (6), the following weight matrix and bias vector is obtained:

$$W = -C_{1}{A}^{T}A\quad \mathrm{and}\quad \theta (t) = C_{ 1}{A}^{T}\mathbf{b} - C_{ 2}c\;{e}^{-C_{3}t}$$

(8)

or

$$w_{ij} = -C_{1}\displaystyle\sum _{k=1}^{m}a_{ ki}a_{kj}\quad \mathrm{and}\quad \theta _{i}(t) = C_{1}\displaystyle\sum {k = 1}^{m}a_{ ki}b_{k} - C_{2}c_{i}\;{e}^{-C_{3}t}$$

(9)

where a _ij is the $i\mathrm{th}$ row $j\mathrm{th}$ column element of A and w _ij is the $i\mathrm{th}$ row $j\mathrm{th}$ column element of W.

Now, in order to set the appropriate weights, we use resistors. The absolute value of synaptic weight w _ij (between $i\mathrm{th}$ and $j\mathrm{th}$ neuron) is obtained as the ratio of the resistors R _f and R _ij, that is,

$$\vert w_{ij}\vert = \frac{R_{f}} {R_{ij}}$$

(10)

The next step is to assign input bias to each neuron. This can be done using the voltage E _i and the resistor R _i, given as

$$\theta _{i} = \frac{R_{f}E_{i}} {R_{i}}$$

(11)

The other part of the input bias, that is, $C_{2}c_{i}{e}^{-C_{3}t}$ is obtained by providing a discharging loop, with an additional set of capacitor C _d, and resistor R _d. The configuration is shown in Fig. 4. The values of R _d and R _f are taken arbitrarily such that $R_{d} \ll R_{f}$. The initial value of C _d is assigned as $-C_{2}\;c_{i}\;\forall \;i$. Once these values are set, the other values are given as

$$\displaystyle\begin{array}{rcl} R_{ij} = \frac{R_{f}} {w_{ij}}& &\end{array}$$

(12)

$$\displaystyle\begin{array}{rcl} C_{3} \approx 1/(R_{d}C_{d})& &\end{array}$$

(13)

The positive (or negative) value of weight is obtained by using the $x_{i}\mathrm{th}$ terminal (or $-x_{i}\mathrm{th}$ terminal) of the neuron. For the sake of simplicity, the neuron in Fig. 4 can be represented as depicted in Fig. 5. Once the structure of a single neuron is formed, it is connected to the other neurons to form the ANN which can be represented as in Fig. 6. As the time proceeds, the neurons update their internal and external energies based on the update rules and based on the dynamics of selecting the neurons for the update. If the network is stable and the dynamics is appropriate, the network will converge to a limit point(local minimum), which will give the optimal solution of the LP [39]. Applying ANNs to solve the LP problems has been studied and analyzed in [25–28, 35, 67, 100, 136].

The objective of depicting the method of developing analog circuit for solving LP was to illustrate the reader various complexities involved in the deployment of an analog ANN for solving any COP. In the following subsections, some of the classical models of the ANNs that are applied in solving COPs will be described. These are the basic models proposed in 1980s–1990s as an alternate analog method to solve COPs. These models can be broadly classified as:

Gradient-based methods
- Hopfield and Tank’s discrete and continuous models
Self-organizing methods
- Durbin-Willshaw’s and Kohonen’s models
Projective gradient methods
- Lagrange-based models

2.3 Hopfield and Tank (H-T) Models

In 1982, Hopfield [56] proposed a novel feedback type neural network, which is capable of performing computational tasks. Although feedback type models were proposed earlier by Anderson and Kohonen [10, 71], however, Hopfield [57] provided a complete mathematical analysis of feedback type neural networks. Thus, these feedback type neural networks were named as Hopfield networks. There are two different models proposed by Hopfield and Tank [58] for solving COPs. They are described below:

(a)
H-T’s Discrete Model

The discrete model consists of n interconnected neurons. The strength of the connection between the $i\mathrm{th}$ neuron and the $j\mathrm{th}$ neuron is represented by the weight $\mathrm{w}_{ij}$. A presence of connection between any two neurons is indicated by a nonzero weight. Weights maybe positive or negative, representing the sense of connection (excitatory or inhibitory). In this model, each neuron has been assigned with the two states (external and internal states, like that in McCullough and Pitts [82] model) representing the embedded memory at each neuron. Moreover, the internal state of $i\mathrm{th}$ neuron $\mathcal{I}_{i}$ is continuous valued, whereas the external state x _i is binary valued. The internal and external state of a neuron are related as

$$\displaystyle\begin{array}{rcl} \mathcal{I}_{i}(t + 1) =\displaystyle\sum _{ j=1}^{n}w_{ ij}x_{j}(t) +\theta _{i}& &\end{array}$$

(14)

$$\displaystyle\begin{array}{rcl} x_{i}(t + 1) = \Psi _{d}(\mathcal{I}_{i})\left \{\begin{array}{ll} 1&\text{if }\mathcal{I}_{i} > U_{i} \\ 0&\text{if }\mathcal{I}_{i} \leq U_{i} \end{array} \right.& &\end{array}$$

(15)

where $\theta _{i}$ is a constant representing an external input bias to the $i\mathrm{th}$ neuron. $\Psi _{d}()$ represents a transfer function, which assign binary values to x _i. U _i represents the threshold value of the $i\mathrm{th}$ neuron. If the aggregate sum of the internal energies supplied to the $i\mathrm{th}$ neuron is greater than U _i, then this neuron will “fire,” that is, x _i will be set to 1. On the other hand, if the aggregate sum of the internal energies is less than the threshold, the neuron will be in a dormant or “not-firing” state. For solving the COP, w _ij’s are calculated based on the objective function and the constraints of a given COP. The only decision variables of this model are x _i and $\mathcal{I}_{i}$. Unless or otherwise specified, the values of U _i’s are taken as 0.

Initially, a random value is assigned to each external state, and the neurons update their states based on the updating rules. Usually, the neurons are selected randomly for the update (asynchronous updating). This updating rule is proven to converge to one of the stable states [83], which minimizes the energy function, provided the energy function is given by Eq. (16):

$$E_{d} = -\frac{1} {2}\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}x_{j} -\displaystyle\sum _{i=1}^{n}\theta _{ i}x_{i}$$

(16)

Since one of the states is binary and the update sequence of neurons is random, this model is known as a discrete model with stochastic updates. With the above details, Algorithm 1 summarizes the mechanism of this model.

This model can be used to compute the local minimum of any quadratic function. Hopfield and Tank [58, 121] showed that the way energy function is constructed will always result in decrease of energy and the algorithm leads to one of the stable states. The algorithm proceeds as the gradient descent method, converging to a local minimum.

The main step in solving a COP using H-T’s discrete model is to convert the constrained COP into an unconstrained COP using a penalty function method (or Lagrange method). Once the unconstrained penalty objective function is obtained, it is compared with the energy function given in Eq. (16).

(b)
H-T’s Continuous Model

In the subsequent research [57], H-T proposed another model in which both the states of the neuron are continuous. In addition, in this model, two additional properties have been assigned to the neurons, namely, the input capacitance C _i and the transmembrane resistance R _i. Moreover, a finite impedance R _ij between the $i\mathrm{th}$ neuron and the output $x_{j}$ from $j\mathrm{th}$ neuron is assigned on the link ij. The main difference between the continuous model and the discrete model is in the continuous model, the external states of the neurons are continuous valued between 0 and 1. The equations of motion for this model are given as

$$\displaystyle\begin{array}{rcl} \frac{\,d\mathcal{I}_{i}} {\,dt} =\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{j}(t) -\frac{\mathcal{I}_{i}} {\tau } +\theta _{i}& &\end{array}$$

(17)

$$\displaystyle\begin{array}{rcl} x_{i} = \Psi _{c}(\mathcal{I}_{i})& &\end{array}$$

(18)

where $\tau = R_{i}C_{i}$ and usually, $\tau$ is assigned a value of 1, if the time step of any discrete time simulation is less than unity. A widely used transfer function is continuous sigmoidal function which is of the form, $\Psi _{c}(\mathcal{I}_{i}) = \frac{1} {2}(1 +\mathrm{ tan}{\it \text{h}}(\frac{\mathcal{I}_{i}} {T} ))$. The slope of the transfer function is controlled by parameter T. Similar to the previous model, this model will converge to a stable state which will minimize the energy function, given by Eq. (19):

$$E_{c} = -\frac{1} {2}\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}x_{j} -\displaystyle\sum _{i=1}^{n}\theta _{ i}x_{i} +\displaystyle\int _{ 0}^{x_{i} }\Psi _{c}^{-1}(a)\,da$$

(19)

The only decision variables in this model are x _i and $\mathcal{I}_{i}$. Algorithm 2 describes the mechanism of this model. From this illustration, it can be seen that a continuous ANN can be applied to solve discrete optimization problems, just by tuning the parameters of the transfer function. In specific, if the value of the slope, T, in the transfer function is set to a very high value, then this model will approximately behave as the discrete model.

The first attempt to solve the COP using ANN was demonstrated by Hopfield and Tank [58] in 1985. They illustrated the usage of ANN by solving sufficiently large instances of the famous traveling salesman problem (TSP). After this illustration by Hopfield and Tank, numerous approaches and extensions were proposed to solve various COPs using ANNs. Most of the proposed extensions were in the direction of improving H-T’s ANN model. In the following subsection, the alternative approaches which are significantly different from H-T’s ANN model are presented.

2.4 Durbin-Willshaw (D-W)’s Model

In 1987, Durbin and Willshaw [34] proposed a fundamentally different approach to the H-T’s approach for solving the geometric COPs. The proposed method was named as elastic nets and often referred as deformable templates. The origin of this method is the seminal work of Willshaw and Malsburg [130]. In this model, the neurons have no memory or states. Moreover, the network is not arbitrarily connected based on the values of the weights. Instead, it is connected in the form of a ring (elastic ring), where the neurons are placed on the circumference of the ring. Since this model is used to solve the geometric COPs, we will describe the model with respect to TSP.

In this model, there are $\mathit{M}$ neurons on the circumference of the ring, where $\mathit{M}$ is greater than $\mathit{N}$, the number of cities. The first step is to find the center of gravity of the network. This point is taken as the center for the ring, and iteratively neurons on the rings are pulled apart. In every iteration, neurons are pulled in such a way that the distance between the neurons and the closest city is minimized. However, there is another force from the ring that tries to keep the length of the ring as small as possible. Thus, in the end, when each city has been close enough to a corresponding neuron, a Hamiltonian tour is obtained. The equation describing the movement of $j\mathrm{th}$ neuron is given as

$$\Delta y_{j} =\alpha \displaystyle\sum _{ i=1}^{N}w_{ ij}(a_{i} - y_{i}) +\beta K(y_{j+1} + y_{j-1} - 2y_{j})$$

(20)

where a _i is the coordinate of the $i\mathrm{th}$ vertex of the TSP; $\alpha,\;\beta,\;$ and K are the scaling parameters:

$$w_{ij} = \frac{\phi (d_{a_{i}y_{j}},K)} {\displaystyle\sum\nolimits_{k=1}^{M} \phi (d_{a_{i}y_{k}},K)}\quad \forall i,\,j$$

(21)

$d_{a_{i}y_{j}}$ is the Euclidean distance between the $i\mathrm{th}$ city and the $j\mathrm{th}$ neuron; and

$$\phi (d_{a_{i}y_{j}},K) = {e}^{{-d_{a_{i}y_{j}} }^{2} /2{K}^{2} }$$

(22)

$\alpha$ represents the scaling factor that drives the neurons toward the cities whereas $\beta$ represents the scaling factor for the force that keeps the neighboring neurons on the ring closer. Parameter K is gradually decreased, so that the neurons get closer to the cities. This parameter acts like temperature in simulated annealing and requires a cooling schedule. The energy of this system is given by Eq. (23):

$$E_{dw} = -\alpha K\displaystyle\sum _{i=1}^{N}ln\displaystyle\sum _{ j=1}^{M}\phi (d_{ a_{i}y_{j}},K) + \frac{\beta } {2}\displaystyle\sum _{j=1}^{M}d_{ y_{j}y_{j+1}}^{2}$$

(23)

The mechanism of this model can be explained by Algorithm 3.

2.5 Kohonen Model

Another fundamentally different approach to the H-T’s approach was proposed by Kohonen [72] for solving geometric COPs. This model utilizes self-organizing maps (SOMs) [70, 71] which fall under the category of the competitive neural networks. In this model, there are two layers of neurons (input and output). The input layer of the neurons are fully connected to the output layer of the neurons. In general, the neurons at the inner layer represent the input from a high-dimensional space. However, the neurons at the output layer represent output to a low-dimensional space. Therefore, the whole model act as a mapping from a higher dimensional space to a lower dimensional space. The property of self-organizing maps is to map the neurons close in the input space to the neurons that are close in the output space. The output layer of the neurons is arranged according to a topological structure, which is based on the geometry of the problem under consideration. For example: for the TSP, the output layer is arranged in a ring structure. Let ${a}^{i} ={ (a_{1}^{i},a_{2}^{i},\ldots,a_{n}^{i})}^{T}\quad \forall i = 1,\ldots,N$ be the coordinates of $i\mathrm{th}$ city to be visited. Let ${w}^{j} ={ (w_{1j},w_{2j},\ldots,w_{Nj})}^{T},\quad \forall j = 1,\ldots,M,\;M \geq N$ represent the weight vector of the $j\mathrm{th}$ output neuron. Unlike the previous models, in this model, the weights are the decision variables. They are changed over the time using the following equation:

$${w}^{j} = {w}^{j} +\mu f_{ j,{j}^{{\ast}}}({a}^{i} - {w}^{j}),\quad \forall j = 1,2\ldots,M,\quad 0 <\mu < 1$$

(24)

where $\mu$ is called learning rate, j ^∗ is called winning neuron, defined as ${j}^{{\ast}} =\mathrm{ argmin}_{\forall \;j}\{d_{{a}^{i}w_{j}}\}$ for a given city i, and the function $f : (j,{j}^{{\ast}})\mapsto [0,1]$ is a decreasing function and represents the lateral distance between output units j and j ^∗. This function acts like a weight of attraction between two points. Typically, the function is modified as the algorithm proceed to gradually reduce the magnitude of the weight of attraction. This method can be described by Algorithm 4.

Thus, this iterative approach at the end produces a Hamiltonian tour. Since both Durbin-Willshaw’s and Kohonen’s models are for geometrical problems, these models have not been widely used in the application of ANNs in solving COPs.

2.6 Lagrangian Model

Another approach which is based on the Lagrange programming is presented by Zhang and Constantinides [139]. The main difference in this method when compared to the Hopfield and Tank [58] approach is that this method is not similar to the gradient descent method. In this method, the constraints and objective function are not coupled using the penalty function. Instead, two types of neurons are used (variable type, x, and Lagrangian type, $\lambda$). One type of neurons (Lagrangian) looks for a feasible solution, and the other type of neurons (variable) looks for an optimal solution.

The main advantage of this model over the previous model is that the objective function can be free from the Lyapunov function criterion and can be different from a quadratic function. That is, any type of function can be optimized, unlike the quadratic function requirements of H-T’s model [58]. For example, consider any general optimization problem defined as

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ & f(\mathbf{x}) & \\ \mathrm{subject\;to} :& & \\ h_{i}(\mathbf{x})& = 0\quad \forall i = 1,\ldots,m& \\ & & \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} \mathbf{x}& \in {\mathbb{R}}^{n}&\end{array} $$

(25)

where $f(\mathbf{x}),\;h_{i}(\mathbf{x}) : {\mathbb{R}}^{n}\mapsto \mathbb{R}$ are any continuous and twice differentiable functions of $\mathbf{x}$. The Lagrange for the problem Eq. (25) is written as

$$L(\mathbf{x},\boldsymbol{\lambda }) = f(\mathbf{x}) +{ \boldsymbol{\lambda }}^{T}h(\mathbf{x})$$

(26)

The equations of motion that describe this model are

$$\displaystyle\begin{array}{rcl} \nabla _{t}\mathbf{x} = -\nabla _{x}L(\mathbf{x},\boldsymbol{\lambda })& &\end{array}$$

(27)

$$\displaystyle\begin{array}{rcl} \nabla _{t}\boldsymbol{\lambda } = -\nabla _{\lambda }L(\mathbf{x},\boldsymbol{\lambda })& &\end{array}$$

(28)

where

$$\displaystyle\begin{array}{rcl} \nabla _{t}\mathbf{x}& =&{ \left (\frac{dx_{1}} {dt}, \frac{dx_{2}} {dt},\ldots, \frac{dx_{n}} {dt} \right )}^{T},\quad x \in {\mathbb{R}}^{n} \\ \nabla _{t}\boldsymbol{\lambda }& =&{ \left (\frac{d\lambda _{1}} {dt}, \frac{d\lambda _{2}} {dt},\ldots, \frac{d\lambda _{m}} {dt} \right )}^{T},\quad \lambda \in {\mathbb{R}}^{m} \\ \nabla _{x}L(\mathbf{x},\boldsymbol{\lambda })& =&{ \left (\frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial x_{1}}, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial x_{2}},\ldots, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial x_{n}} \right )}^{T} \\ \end{array}$$

and

$$\nabla _{\lambda }L(\mathbf{x},\boldsymbol{\lambda }) ={ \left (\frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial \lambda _{1}}, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial \lambda _{2}},\ldots, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial \lambda _{m}} \right )}^{T}$$

The above model is iterative as described in Algorithm 5, similar to that of H-T’s continuous model. Both the types of neuron are the decision variables.

The mathematical properties and proofs related to convergence of ANN models will be discussed in Sect. 3 of this chapter.

2.7 General Methodology

In the previous subsections, a brief picture of different models that can be used to solve a COP is presented with their algorithmic structure. In the following part of this subsection, general guidelines for mapping COPs onto the ANNs models will be presented.

2.7.1 Guidelines for H-T’s Model (Discrete and Continuous)

The output state of the neurons represents the decision variables of the COP.
Combine the objective function and constraints using the penalty functions, such that the overall modified objective function is quadratic in nature, and it is called as the energy function of the system.
Obtain the weights and the bias for each neuron. This is the main step that maps the COP onto ANN. This can be done by any one of the following methods:
- Compare the coefficients of terms in energy function with the coefficients in E _d or E _c functions. This comparison will determine the weights on the links and will determine the input bias of a neuron.
- Compare the coefficients of degree 1 in the differential update rule to the weights. Similarly, compare the constant term of differential update rule to the bias.
Assign a random initial state to all the neurons, and proceed as described in Algorithm 1 or Algorithm 2.

Note: Only models with linear and/or quadratic objective function can be solved using the classical H-T’s model. For solving the general COPs, various extensions are proposed, like the Lagrange function method.

2.7.2 Guidelines for Durbin-Willshaw’s Model

The decision variable of the COP is represented by the Euclidean position of the neurons.
Find the center of gravity of the given geometry and mark it as the ring’s center.
Initially, all the neurons are placed uniformly on the circumference of the ring.
Update the position of the neurons as described in Algorithm 3.

2.7.3 Guidelines for Kohonen’s Model

The decision variables of the COP are represented by the weight vector.
Initially, all the cities of the input layer are connected with all the neurons of output layer.
Update the weights of the neurons as described in Algorithm 4.

Note that only COPs which have a low-dimensional geometrical structure can be solved using the Durbin-Willshaw’s model or the Kohonen’s model. However, there has been extension proposed for solving other COPs, like self-organizing neural networks (SONNs) [108]. SONNs exploit the fact that solutions to many optimization problems can be seen as finding the best arrangement in the permutation matrix. The best arrangement is the one which is both feasible and optimal to the given COP.

2.7.4 Guidelines for Lagrangian Model

The output state of the neurons (both Lagrangian and variable neurons) represents the decision variable of the COP.
Combine the objective function and constraints using the Lagrange method. This overall Lagrange function is called the energy function of the system.
Obtain the weights and the bias. This is the main step that maps COP onto ANN. This can be done by any one of the following methods:
- Compare the coefficients of terms in energy function with the coefficients in the $L(\mathbf{x},\boldsymbol{\lambda })$ function. This comparison will determine the weights on the links and will determine the input bias of a neuron.
- Compare the coefficients of degree 1 in the differential update rule to the weights. Similarly, compare the constant term of differential update rule to the bias.
Assign a random initial state to all neurons and proceed as described in Algorithm 5.

In the following subsection, a detail mapping of a COP using the H-T’s model will be presented.

2.8 Example: Mapping

In this subsection, an illustration of mapping a COP to the H-T’s model will be presented. A symmetric traveling salesman problem (TSP) is selected for the illustration because this problem has been considered as a standard representative test problem for COPs.

2.8.1 Hopfield Mapping

Consider the TSP formulated as

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ & \displaystyle\sum _{i=1}^{N}\displaystyle\sum _{k=1,k\neq i}^{N}\displaystyle\sum _{j=1}^{N}d_{ik}x_{ij}(x_{k,i+1} + x_{k,i-1})&\end{array}$$

(29)

$$\displaystyle\begin{array}{rcl} \mathrm{subject\;to} :& & \\ \displaystyle\sum _{i=1}^{N}x_{ ij}& = 1\quad \forall j = 1,\ldots,N&\end{array}$$

(30)

$$\displaystyle\begin{array}{rcl} \displaystyle\sum _{j=1}^{N}x_{ ij}& = 1\quad \forall i = 1,\ldots,N&\end{array}$$

(31)

$$\displaystyle\begin{array}{rcl} x_{ij}& \in 0,1\quad \forall i,j = 1,\ldots,N&\end{array}$$

(32)

where

$$x_{ij} = \left \{\begin{array}{ll} 1&\mbox{ if city }i\mbox{ is in the position }j\\ 0 &\mbox{ otherwise} \end{array} \right.$$

and d _ik represents the distance between the cities i and j. The first step in mapping TSP to a H-T’s model will be converting the given problem into an unconstrained quadratic programming problem, using the penalty functions. Following transformations are used by Hopfield and Tank to map the problem:

The function in Eq. (30) is transformed into
$$\displaystyle\sum _{i=1}^{N}\displaystyle\sum _{ k=1,k\neq i}^{N}x_{ ij}x_{kj}$$
The function in Eq. (31) is transformed into
$$\displaystyle\sum _{j=1}^{N}\displaystyle\sum _{ k=1,k\neq j}^{N}x_{ ij}x_{ik}$$
In order to preserve the equality of 1 in Eqs. (30) and (31), additional penalty is added as
$${\left (\displaystyle\sum _{i=1}^{N}\displaystyle\sum _{ j=1}^{N}x_{ ij} - N\right )}^{2}$$

Thus, the penalized objective function can be written as

$$\displaystyle\begin{array}{rcl} E_{\mathrm{tsp}}& =& \frac{A} {2} \displaystyle\sum _{j=1}^{N}\displaystyle\sum _{ i=1}^{N}\displaystyle\sum _{ k=1,k\neq i}^{N}x_{ ij}x_{kj} + \frac{B} {2} \displaystyle\sum _{i=1}^{N}\displaystyle\sum _{ j=1}^{N}\displaystyle\sum _{ k=1,k\neq j}^{N}x_{ ij}x_{ik} \\ & & +\frac{C} {2}{ \left (\displaystyle\sum _{i=1}^{N}\displaystyle\sum _{ j=1}^{N}x_{ ij} - N\right )}^{2} + \frac{D} {2} \displaystyle\sum _{i=1}^{N}\displaystyle\sum _{ k=1,k\neq i}^{N}\displaystyle\sum _{ j=1}^{N}d_{ ik}x_{ij}(x_{k,i+1} + x_{k,i-1})\end{array}$$

(33)

Comparing the Eq. (33) with (1), we have the following values for the synaptic weights and the input bias:

$$\displaystyle\begin{array}{rcl} \mathrm{w}_{ijkl} = -A\delta _{ik}(1 -\delta _{jl}) - B\delta _{jl}(1 -\delta _{ik}) - C - D\delta _{ik}(\delta _{l(j+1)} +\delta _{l(j-1)})& &\end{array}$$

(34)

$$\displaystyle\begin{array}{rcl} \theta _{ij} = CN& &\end{array}$$

(35)

where $\delta _{ik}$ is the Kronecker Delta.^{Footnote 9} With the values of weight and bias, H-T’s energy function can be written as

$$E_{\mathrm{tsp}} = -\frac{1} {2}\displaystyle\sum _{i,j,k,l=1}^{N}\mathrm{w}_{ ijkl}x_{ij}x_{kl} -\displaystyle\sum _{i,j=1}^{N}\theta _{ ij}x_{ij}$$

(36)

Thus, the system represented by Eqs. (34)–(36) can be implemented via an analog circuit or stepwise simulation on a digital computer. Although this mapping is not the best mapping (see [119] for a better mapping) for TSP, however, the purpose of this mapping was to illustrate the reader about various changes that are needed for mapping a simple COP onto a given ANN.

In this section, some of the classical models of the ANNs applied to solve the COPs were reviewed. However, the conditions under which the ANNs will converge to the optimal solution are not addressed in this section. In Sect. 3, conditions that guarantee the stability and the convergence ability of ANNs while solving the COPs will be presented.

3 Optimality Conditions

In the previous sections, it was stated that the ANNs will perform a gradient descent method. However, there were no mathematical analysis or proofs provided to support the claim. In this section, the stability and convergence analysis of ANNs will be presented. The objective of the such analysis is to prevent the network from oscillation or chaos and to guarantee its convergence to a local minimum [83, 106]. That is, to analyze that under what conditions from any arbitrary initial point, the ANN will converge to a local stable point. Although the stability term is used frequently from the electrical engineering point of view but from the view of optimization, these are more likely to be called as the optimality conditions. Therefore, in the following subsections, the necessary and sufficient conditions that lead the ANNs to converge at the local minima will be analyzed.

3.1 System Dynamics

The method of selecting neurons for the update in a given ANN is called its system dynamics. In the previous subsection, a random method to select a neuron for update is mentioned. However, it should be a valid question to ask, why to select a neuron randomly for the update? Why not select them in a synchronous, parallel, or consecutive way? This subsection investigates the best methods of selecting neurons. Before analyzing different update rules, in this subsection, different dynamics that can be applied to an ANN are described for a discrete state discrete dynamics system.

Consider a general form of discrete time ANN with the following dynamic equations:

$$\displaystyle\begin{array}{rcl} \mathcal{I}_{i}(t + 1) =\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{j}(t) +\theta _{i}& &\end{array}$$

(37)

$$\displaystyle\begin{array}{rcl} x_{i}(t + 1) = \Psi _{d}(\mathcal{I}_{i}) = \left \{\begin{array}{ll} 1 &\text{if }\mathcal{I}_{i} > U_{i} \\ - 1&\text{if }\mathcal{I}_{i} \leq U_{i} \end{array} \right.& &\end{array}$$

(38)

where x _i and $\mathcal{I}_{i}$ represent the external state and internal state of an $i\mathrm{th}$ neuron and w _ij represents the weight between an $i\mathrm{th}$ neuron and a $j\mathrm{th}$ neuron. $\theta _{i}$ is a constant representing an external input to the $i\mathrm{th}$ neuron. Function $\Psi _{d}()$ is called as the transfer function. It assigns binary values to x _i. It should be noted that function $\Psi _{d}()$ is slightly different from the conventional sign or signum function. This is due to the simple observation that a single dormant neuron (i.e., not-firing neuron) will never excite itself (since w _ii = 0). U _i represents the threshold value of the $i\mathrm{th}$ neuron. If the aggregate sum of internal energies supplied to the $i\mathrm{th}$ neuron is greater than U _i, then this neuron will “fire,” that is, x _i will be set to 1. On the other hand, if the aggregate sum of internal energies is less than the threshold, the neuron will be in a dormant or “not-firing” state. Note that, here the external state of the neurons can take values of − 1 or 1. The dynamic equation of the discrete state system can be written in a compact form as

$$x_{i}(t + 1) = \Psi _{d}\left (\displaystyle\sum _{j=1}^{n}\mathrm{w}_{ ij}x_{j}(t) +\theta _{i}\right )$$

(39)

The three main types of dynamics that can be implemented in any ANN are described as:

Synchronous Dynamics
$$x_{i}(t + 1) = \Psi _{d}\left (\displaystyle\sum _{j=1}^{n}\mathrm{w}_{ ij}x_{j}(t) +\theta _{i}\right )\quad \forall \;i = 1,\ldots,n$$
(40)
Parallel Dynamics
$$x_{i}(t+1) = \left \{\begin{array}{ll} \Psi _{d}\left (\displaystyle\sum _{j=1}^{n}\mathrm{w}_{ ij}x_{j}(t) +\theta _{i}\right )&\forall \;i\; \in \; S(t) \\ x_{i}(t) &\mathrm{otherwise}\end{array} \right.$$
(41)
Consecutive Dynamics
$$x_{i}(t+1) = \left \{\begin{array}{ll} \Psi _{d}\left (\displaystyle\sum _{j=1}^{n}\mathrm{w}_{ ij}x_{j}(t) +\theta _{i}\right )&i = K \\ x_{i}(t) &\mathrm{otherwise}\end{array} \right.$$
(42)

where S(t)^{Footnote 10} represents the set of selected neurons at time t and K ^{Footnote 11} represents the single selected vertex at time t. The above equations were defined for discrete state discrete dynamic system. That is, if the system is updated in discrete time, then the system is called as the discrete time system. Similarly, if the state (external) of the system is discrete, it is called as the discrete state system. On the other hand, if the system is updated in continuous time, it is called as the continuous time system. Moreover, if it has continuous state, it is called as the continuous state system. Thus, in general, all the ANN can be classified as discrete states discrete dynamics (DSDD), continuous state discrete dynamics (CSDD), continuous state continuous dynamics (CSCD), and discrete state continuous dynamics (DSCD). In the following sub sections, stability and optimality analysis of different systems will be presented.

3.2 Lyapunov Energy Function

When Hopfield introduced the notion of feedback type ANN, Lyapunov function was used to analyze the convergence characteristics of the H-T’s model. This is done in order to prove that H-T’s models are stable in nature and are capable of finding solutions to the optimization problem. Keeping the importance of Lyapunov analysis of ANNs in COPs, some of the main results from the literature will be presented in this subsection. Before proceeding for the Lyapunov analysis, some of the definitions are to be stated.

Basic Definitions:

Consider a dynamic system of the following form:

$$\nabla \mathbf{x}(t) = E(\mathbf{x}(t)),\quad \mathbf{x}(t_{0}) = \mathbf{x}_{0}\; \in \; {\mathbb{R}}^{n}$$

(43)

Definition 1

A point ${\mathbf{x}}^{{\ast}}$ is called the equilibrium point (or critical point, or steady state point) of the system described by Eq. (43) if $E({\mathbf{x}}^{{\ast}}) = 0$.

Definition 2

A function $f : {\mathbb{R}}^{n}\mapsto \mathbb{R}$ is said to be a Lyapunov function if it holds the following properties:

Function f is continuous and differentiable.
The partial derivatives of function f w.r.t. any element of $\mathbf{x}$ are continuous.
Function f is nonnegative function over its entire domain.
The derivative of function f w.r.t. time is nonpositive.

Definition 3

For the system described by Eq. (43), an equilibrium point ${\mathbf{x}}^{{\ast}}$ is said to be Lyapunov stable (or stable in the sense of Lyapunov) if for any positive scalar $\epsilon$, there exists a positive scalar $\delta$ such that

$$\vert \vert \mathbf{x}(t_{0}) -{\mathbf{x}}^{{\ast}}\vert \vert <\delta \quad \rightarrow \quad \vert \vert \mathbf{x}(t) -{\mathbf{x}}^{{\ast}}\vert \vert < \epsilon \quad \forall t \geq t_{ 0}$$

Definition 4

An equilibrium point ${\mathbf{x}}^{{\ast}}$ is said to be asymptotically stable (or asymptotically stable in the sense of Lyapunov) if the following conditions are satisfied:

${\mathbf{x}}^{{\ast}}$ is stable.
$\lim _{t\rightarrow \infty }\mathbf{x}(t) ={ \mathbf{x}}^{{\ast}}$.

For the detailed explanation of the above definitions, see [76].

Analysis of DSDD Systems

Now, let us begin with systems of type DSDD. The two widely known Lyapunov functions for the DSDD systems can be written as

Lyapunav 1
$$ E_{\mathrm{DD1}}(t + 1) = -\left < \mathbf{x}(t + 1),W\mathbf{x}(t)\right > + \left < \boldsymbol{\theta },(\mathbf{x}(t + 1) + \mathbf{x}(t))\right > $$
(44)
or
$$\displaystyle\begin{array}{rcl} E_{\mathrm{DD1}}(t + 1)& =& -\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t + 1)x_{j}(t) \\ & +& \displaystyle\sum _{i=1}^{n}\theta _{ i}(x_{i}(t + 1) + x_{i}(t)) \end{array}$$
(45)
Lyapunav 2
$$ E_{\mathrm{DD2}}(t + 1) = -\frac{1} {2}\left < \mathbf{x}(t + 1),W\mathbf{x}(t + 1)\right > + \left < \boldsymbol{\theta },(\mathbf{x}(t + 1))\right > $$
(46)
or
$$E_{\mathrm{DD2}}(t + 1) = -\frac{1} {2}\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t + 1)x_{j}(t + 1) +\displaystyle\sum _{ i=1}^{n}\theta _{ i}(x_{i}(t + 1))$$
(47)

Theorem 1

If W is symmetric and update rule is synchronous, then a discrete system defined by Eq. ( 45 ) will converge to either a local minima or to a two-edge cycle.

Proof

The change brought to the energy function given by Eq. (45) in one iteration is

$$\displaystyle\begin{array}{rcl} \bigtriangleup E_{\mathrm{DD1}}(t + 1)& =& (E_{\mathrm{DD1}}(t + 1)) - (E_{\mathrm{DD1}}(t)) \\ & =& \left (-\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t + 1)x_{j}(t) +\displaystyle\sum _{ i=1}^{n}\theta _{ i}(x_{i}(t + 1) + x_{i}(t))\right ) \\ & & -\left (-\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t)x_{j}(t - 1) +\displaystyle\sum _{ i=1}^{n}\theta _{ i}(x_{i}(t) + x_{i}(t - 1))\right ) \\ & & \end{array}$$

(48)

solving Eq. (48) will lead to the following:

$$\displaystyle\begin{array}{rcl} \bigtriangleup E_{\mathrm{DD1}}(t + 1)& =& \left (\displaystyle\sum _{i=1}^{n}\theta _{ i}(x_{i}(t + 1) - x_{i}(t - 1))\right ) \\ & & +\left (-\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t + 1)x_{j}(t) +\displaystyle\sum _{ i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t)x_{j}(t - 1)\right ) \\ & & \end{array}$$

(49)

Changing index in the second term and using symmetry of W matrix, following simplification is obtained:

$$\displaystyle\begin{array}{rcl} \bigtriangleup E_{\mathrm{DD1}}(t + 1)& =& \left (\displaystyle\sum _{i=1}^{n}\theta _{ i}(x_{i}(t + 1) - x_{i}(t - 1))\right ) \\ & & +\left (-\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{j}(t)(x_{i}(t + 1) - x_{i}(t - 1))\right )\end{array}$$

(50)

Rewriting the above equation,

$$\displaystyle\begin{array}{rcl} \bigtriangleup E_{\mathrm{DD1}}(t + 1)& =& -\displaystyle\sum _{i=1}^{n}(x_{ i}(t + 1) - x_{i}(t - 1))\left (\displaystyle\sum _{j=1}^{n}\mathrm{w}_{ ij}x_{j}(t) -\theta _{i}\right )\end{array}$$

(51)

For each $i\mathrm{th}$ neuron, the contribution to the change in the energy function will be given by

$$\displaystyle\begin{array}{rcl} (\bigtriangleup E_{\mathrm{DD1}}(t + 1))_{i} = -(x_{i}(t + 1) - x_{i}(t - 1))\left (\displaystyle\sum _{j=1}^{n}\mathrm{w}_{ ij}x_{j}(t) -\theta _{i}\right )& &\end{array}$$

(52)

or

$$\displaystyle\begin{array}{rcl} (\bigtriangleup E_{\mathrm{DD1}}(t + 1))_{i} = (x_{i}(t - 1) - x_{i}(t + 1))\;\mathcal{I}_{i}(t)& &\end{array}$$

(53)

Thus, for the Lyapunov energy function given by Eq. (45), the change in energy of $i\mathrm{th}$ neuron is given by Eq. (3.2).

Since the update rule is synchronous, at every iteration any of the following scenarios may arise:

If $\mathcal{I}_{i}(t) > 0 \Rightarrow x_{i}(t + 1) = 1 \Rightarrow (\bigtriangleup E_{DD1}(t + 1))_{i} \leq 0\quad \forall \;i\; \in \; N.$
If $\mathcal{I}_{i}(t) \leq 0 \Rightarrow x_{i}(t + 1) = 0 \Rightarrow (\bigtriangleup E_{DD1}(t + 1))_{i} \leq 0\quad \forall \;i\; \in \; N.$

From the above implication, for every $i\mathrm{th}$ neuron, $(\bigtriangleup E_{DD1}(t + 1))_{i} \leq 0$. Thus, the system is converging. At the converging limit, $(\bigtriangleup E_{DD1}(t + 1))_{i} = 0$. This implies the following cases:

$\mathcal{I}_{i}(t) = 0 \Rightarrow x_{i}(t) = x_{i}(t + 1) = 0\quad \forall \;i\; \in \; N$, a trivial case and can be discarded
$x_{i}(t - 1) - x_{i}(t + 1) = 0\quad \forall \;i\; \in \; N.$
- $\Rightarrow x_{i}(t - 1) = x_{i}(t + 1)\neq x_{i}(t)\quad \forall \;i\; \in \; N\quad \Rightarrow $ two-edge cycles.
- $\Rightarrow x_{i}(t - 1) = x_{i}(t + 1) = x_{i}(t)\quad \forall \;i\; \in \; N\quad \Rightarrow $ local minima.

Thus, the system will converge to a local minimum or to a two-edge cycle.

The next case to consider will be DSDD system with parallel dynamics. If a similar energy function is used, then following a similar argument, it is observed for the parallel dynamics that if $i\; \in \; S(t - 1)$ and $i\;\notin \;S(t)$, then the sign of Eq. (3.2) is undetermined. Thus, there could be an increase in the energy of the system, if parallel dynamics is followed. However, this case will not arise, if the set of neurons for parallel update is selected in a specified way and if a different energy function is used. The following theorem will give the necessary and sufficient condition for convergence of parallel dynamics.

Theorem 2

If W is a symmetric matrix with zero diagonals and if the update rule is parallel, then a discrete system defined by Eq. ( 47 ) will converge to a local minima if and only if S(t) contains an independent set of neurons for each iteration t. Where S(t) is said to contain independent set of neurons, if $i,\,j\; \in S(t)$, then w _i,j = 0

Proof

Consider a generalized Lyapunov energy function given by Eq. (47). Let S(t) be the independent set of neurons. The change in energy of the system is given by

$$\displaystyle\begin{array}{rcl} \bigtriangleup E_{\mathrm{DD2}}(t + 1)& =& (E_{\mathrm{DD2}}(t + 1)) - (E_{\mathrm{DD2}}(t)) \\ & =& \frac{1} {2}\left (-\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t + 1)x_{j}(t + 1) +\displaystyle\sum _{ i=1}^{n}\theta _{ i}(x_{i}(t + 1))\right ) \\ & & -\frac{1} {2}\left (-\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t)x_{j}(t) +\displaystyle\sum _{ i=1}^{n}\theta _{ i}(x_{i}(t))\right ) \end{array}$$

(54)

Splitting the above equation for the cases $i,j\; \in \; S(t)$ and using symmetric property of W, we get

$$\displaystyle\begin{array}{rcl} \bigtriangleup E_{\mathrm{DD2}}(t + 1)& =& \frac{1} {2}\left (-\displaystyle\sum _{i\in S(t)}\displaystyle\sum _{j\notin S(t)}\mathrm{w}_{ij}x_{i}(t + 1)x_{j}(t + 1) +\displaystyle\sum _{ i=1}^{n}\theta _{ i}(x_{i}(t + 1))\right ) \\ & & -\frac{1} {2}\left (-\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t)x_{j}(t) +\displaystyle\sum _{ i=1}^{n}\theta _{ i}(x_{i}(t))\right ) \end{array}$$

(55)

Now, based on the above equation, the following two scenarios may happen:

If $\mathcal{I}_{k}(t) > 0 \Rightarrow x_{k}(t + 1) = 1 \Rightarrow \bigtriangleup E_{\mathrm{DD2}}(t + 1) \leq 0$
If $\mathcal{I}_{k}(t) \leq 0 \Rightarrow x_{k}(t + 1) = 0 \Rightarrow \bigtriangleup E_{\mathrm{DD2}}(t + 1) \leq 0$

Thus, the above system is converging. At the converging limit, $\bigtriangleup E_{DD2}(t + 1) = 0$, which implies the following cases:

$\mathcal{I}_{k}(t) = 0 \Rightarrow x_{k}(t) = x_{k}(t + 1) = 0\quad \forall \;k\; \in \; N$, a trivial case and can be discarded.
$x_{k}(t) - x_{k}(t + 1) = 0\quad \forall \;k\; \in \; N\quad \Rightarrow $ local minima.

Thus, the system will converge to a local minimum. For the proof of the converse, see [83].

Theorem 3

If W is symmetric, the update rule is consecutive, then a discrete system defined by Eq. ( 47 ) will converge to a local minimum.

Proof

Consider the Lyapunov energy function given by Eq. (47). The change in energy is given by

$$\displaystyle\begin{array}{rcl} \bigtriangleup E_{\mathrm{DDC}}(t + 1)& =& (E_{\mathrm{DDC}}(t + 1)) - (E_{\mathrm{DDC}}(t)) \\ & =& \frac{1} {2}\left (-\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t + 1)x_{j}(t + 1) +\displaystyle\sum _{ i=1}^{n}\theta _{ i}(x_{i}(t + 1))\right ) \\ & & -\frac{1} {2}\left (-\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t)x_{j}(t) +\displaystyle\sum _{ i=1}^{n}\theta _{ i}(x_{i}(t))\right ) \end{array}$$

(56)

Splitting the above equation for cases when i = K and when $i\neq K$ and using the property that W is symmetric, we get

$$\bigtriangleup E_{\mathrm{DDC}}(t + 1) = (x_{k}(t) - x_{k}(t + 1))\;\mathcal{I}_{k}(t)$$

(57)

The following two scenarios may happen:

If $\mathcal{I}_{k}(t) > 0 \Rightarrow x_{k}(t + 1) = 1 \Rightarrow \bigtriangleup E_{\mathrm{DDC}}(t + 1) \leq 0$
If $\mathcal{I}_{k}(t) \leq 0 \Rightarrow x_{k}(t + 1) = 0 \Rightarrow \bigtriangleup E_{\mathrm{DDC}}(t + 1) \leq 0$

Thus, the above system is converging. At the converging limit, $\bigtriangleup E_{DDC}(t + 1) = 0$, which implies the following cases:

$\mathcal{I}_{k}(t) = 0 \Rightarrow x_{k}(t) = x_{k}(t + 1) = 0\quad \forall \;k\; \in \; N$, a trivial case and can be discarded.
$x_{k}(t) - x_{k}(t + 1) = 0\quad \forall \;k\; \in \; N\quad \Rightarrow $ local minimum.

Thus, the system will converge to a local minimum.

Theorem 4

If W is symmetric positive semidefinite, then a discrete system defined by Eq. ( 47 ) will converge to a local minimum.

Proof

See [83].

The above analysis is performed for DSDD system. Another important class of system is CSDD system. Similar to the discrete state, three main types of dynamics can be applied for the continuous state system; they can be described as

Synchronous Dynamics
$$x_{i}(t + 1) = \Psi _{c}\left (\displaystyle\sum _{j=1}^{n}\mathrm{w}_{ ij}x_{j}(t) +\theta _{i}\right )\quad \forall \;i = 1,\ldots,n$$
(58)
Parallel Dynamics
$$x_{i}(t+1) = \left \{\begin{array}{ll} \Psi _{c}\left (\displaystyle\sum _{j=1}^{n}\mathrm{w}_{ ij}x_{j}(t) +\theta _{i}\right )&\forall \;i\; \in \; S(t) \\ x_{i}(t) &\mathrm{otherwise}\end{array} \right.$$
(59)
Consecutive Dynamics
$$x_{i}(t+1) = \left \{\begin{array}{ll} \Psi _{c}\left (\displaystyle\sum _{j=1}^{n}\mathrm{w}_{ ij}x_{j}(t) +\theta _{i}\right )&i = K \\ x_{i}(t) &\mathrm{otherwise}\end{array} \right.$$
(60)

where S(t) and K have the same meaning as in the DSDD case. The main difference between the above equations and the discrete state equations is the definition of the transfer function. Here the transfer function is not similar to sign function. However, here the transfer can be a general real valued monotonically increasing continuous function with the following properties:

$$\displaystyle\begin{array}{rcl} if\;a\; \rightarrow \;\infty,\;\Psi _{c}(a)\; \rightarrow \; 1\quad \forall \;a\; \in \; \mathbb{R}& & \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} if\;a\; \rightarrow \;-\infty,\;\Psi _{c}(a)\; \rightarrow \;-1\quad \forall \;a\; \in \; \mathbb{R}& &\end{array}$$

(61)

Apart from the above properties, the following property is also incorporated to maintain an easy shift from a discrete state to a continuous state neuron. This is one of the important property of the transfer function, which adds the flexibility of using a continuous state model for solving a discrete COP.

$$\Psi _{c}(\mu a)\; \rightarrow \; \Psi _{d}(a)\quad as\;\mu \; \rightarrow \;\infty \;$$

(62)

Even the monotonic increasing criteria is not necessary. To be specific, any monotonic nondecreasing continuous function with the above properties may be used as the transfer function. However, the monotonic increasing continuous function is used only for the sake of simplicity and clearness [83].

There are many standard mathematical functions that appropriately fit the above restriction for the transfer function. The following are some of the widely used transfer functions:

1.
$\Psi _{c}(a)\, =\,\mathrm{ tan}\,{\it \text{h}}(a)$
2.
$\Psi _{c}(a)\, =\, \frac{2} {\pi } \mathrm{{tan}}^{-1}(a)$
3.
$\Psi _{c}(a)\, =\, \frac{1-{e}^{-a}} {1+{e}^{-a}}$
4.
$\Psi _{c}(a)\, =\,\mathrm{ cot}\,{\it \text{h}}(a) -\frac{1} {a}$

In the following part of this section, some more theorems without proof for the CSDD systems are presented. Interested readers are directed to [83] for the detailed proofs of the following theorems.

Theorem 5

If W is symmetric, $w_{ii} \geq 0$, and the update rule is synchronous, then a discrete system will converge to a local minima or a two-edge cycle, if the following energy function is used.

$$\displaystyle\begin{array}{rcl} & & E_{\mathrm{CDS}}(t + 1) = -\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t + 1)x_{j}(t) + \\ & & \displaystyle\sum _{i=1}^{n}\theta _{ i}\left (x_{i}(t + 1) + x_{i}(t) +\displaystyle\sum _{ i=1}^{n}\left (\displaystyle\int _{ 0}^{x_{i}(t+1)}{\Psi_{c}}_{i}^{-1} (a)\,d(a) +\displaystyle\int_{ 0}^{x_{i}(t)} {\Psi_c}_{i}^{-1}(a)\,da\right )\right ) \\ & & \end{array}$$

(63)

Theorem 6

If W is symmetric and positive semidefinite, then a discrete system will converge to a local minima (irrespective of the type of dynamic used), if the following energy function is used.

$$\displaystyle\begin{array}{rcl} E_{\mathrm{CD}}(t + 1)& =& -\frac{1} {2}\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t + 1)x_{j}(t) \\ & & +\displaystyle\sum _{i=1}^{n}\theta _{ i}(x_{i}(t + 1)) +\displaystyle\sum _{ i=1}^{n}\displaystyle\int _{ 0}^{x_{i}(t+1)}{{\Psi _{ c}}}_{i}^{-1}(a)\,da\end{array}$$

(64)

The restriction of positive semidefinite W can be relaxed if $\Psi _{c}$ and ${\Psi _{c}}^{-1}$ are differentiable. Let $\zeta _{i}$ be a scalar, defined as

$${\Psi _{c}}_{i}^{{\prime}}\;\leq \;\zeta _{ i}\quad \mathrm{and}\quad {[{\Psi _{c}}^{-1}]}^{{\prime}}\;\geq \; \frac{1} {\zeta _{i}}$$

(65)

If such $\zeta _{i}$’s exists, then the following theorem holds:

Theorem 7

Let $\widehat{w}_{ij}\; =\; w_{ij} + \frac{\delta _{ij}} {\zeta _{ij}}$, where $\delta _{i,j}$ is Kronecker delta. Let the energy function be defined as

$$\displaystyle\begin{array}{rcl} E_{\mathrm{CD}}(t + 1)& =& -\frac{1} {2}\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t + 1)x_{j}(t) \\ & & +\displaystyle\sum _{i=1}^{n}\theta _{ i}(x_{i}(t + 1)) +\displaystyle\sum _{ i=1}^{n}\displaystyle\int _{ 0}^{x_{i}(t+1)}{{\Psi _{ c}}}_{i}^{-1}(a)\,da\end{array}$$

(66)

If $\widehat{W}$ is positive definite, then a discrete system will converge to a local minimum (irrespective of the type of dynamic used).

Furthermore, for the specific case of consecutive dynamics, a more relaxed optimality condition can be used.

Theorem 8

Let the energy function be defined as

$$\displaystyle\begin{array}{rcl} E_{\mathrm{CD}}(t + 1)& =& -\frac{1} {2}\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}\mathrm{w}_{ ij}x_{i}(t + 1)x_{j}(t) \\ & & +\displaystyle\sum _{i=1}^{n}\theta _{ i}(x_{i}(t + 1)) +\displaystyle\sum _{ i=1}^{n}\displaystyle\int _{ 0}^{x_{i}(t+1)}{\Psi _{ c}}_{i}^{-1}(a)\,da\end{array}$$

(67)

If $w_{ii} \geq 0$ , $w_{ij} + \frac{1} {\zeta _{i}} > 0$, the update rule is consecutive, and $\Psi _{c}$ & ${\Psi _{c}}^{-1}$ are differentiable, then a discrete system will converge to a local minimum.

A similar analysis can be done for continuous dynamics. Interested readers are referred to [83] for detailed analysis. The above theorems can be easily extended to {0, 1} systems. Continuous dynamics are usually hard to simulate on the digital computers and are mostly used in designing the analog circuits. The typical way of implementing continuous dynamics on the digital computer is to discretize the time into small steps. Nevertheless, the objective of presenting the proofs for discrete state systems is to highlight the reader that the convergence of ANNs is a critical issue. Moreover, the convergence of ANNs depends not only on the energy function but also on the systems dynamics. However, it can also be seen from the above proofs that a careful design of energy function (based on various properties of the weight matrix, W) will converge the system to a local minimum, irrespective of the dynamics.

Although Lyapunov function analysis is enough to understand that only for specific scenarios ANNs will converge to local optima. Lyapunov analysis is useful for those optimization problems whose constraints and objective function can be converted into a quadratic form. Moreover, the constraints are to be incorporated into the Lyapunov function using an appropriate penalty function. In general, it is hard to obtain a Lyapunov energy function for a given system. This does not mean that ANNs cannot be applied to these systems. Therefore, for such systems, different types of energy functions are designed.

The primary approach that replaced the usage of Lyapunov energy function analysis is the usage of a general penalty function analysis (by incorporating constraints of a given optimization problem [77]). In the following subsections, the convergence and stability criteria for penalty- and Lagrange-based ANNs will be presented.

3.3 Penalty-Based Energy Functions

Consider the following constrained optimization problem:

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ & f(\mathbf{x}) & \\ \mathrm{subject\;to} :& & \\ g_{i}(\mathbf{x})& \geq 0\quad \forall i = 1,\ldots,m& \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} \mathbf{x}& \in {\mathbb{R}}^{n}&\end{array}$$

(68)

where, $f,g_{i} : {\mathbb{R}}^{n}\mapsto \mathbb{R}$. It is assumed that both f, g _i are continuously differentiable functions. The above problem can be converted to an unconstrained optimization problem as

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ & E_{\mathrm{pen}}(\mathbf{x})& \\ \mathrm{subject\;to} :& & \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} \mathbf{x}& \in {\mathbf{R}}^{n}&\end{array}$$

(69)

where

$$E_{\mathrm{pen}}(\mathbf{x}) = f(\mathbf{x}) +\displaystyle\sum _{ i=1}^{m}c_{ i}P_{i}(\mathbf{x})$$

(70)

and $P_{i}(\mathbf{x})$ is the penalty function. Now, the convergence of ANNs to solve the above problem depends upon the structure of the penalty function. In this work, convergence analysis based on the penalty function of the form $P_{i}(\mathbf{x}) = \frac{1} {2}{[\mathrm{min}\{0,g_{i}(\mathbf{x})\}]}^{2}\forall \,i\, =\, 1,\ldots,m$ will be presented. One of the properties of this penalty function is that it can be converted to a continuous differentiable penalty function [77]. Consider the following transformation:

Let

$$\nu _{i} = \left \{\begin{array}{ll} c_{i}\,g_{i}(\mathbf{x})&\text{if }g_{i}(\mathbf{x}) < 0 \\ 0 &\text{if }g_{i}(\mathbf{x}) \geq 0 \end{array} \right.\quad \forall \,i\, =\, 1,\ldots,m$$

(71)

then,

$$c_{i}\,P_{i}(\mathbf{x}) = \frac{c_{i}} {2} {[\mathrm{min}\{0,g_{i}(\mathbf{x})\}]}^{2} = \frac{1} {2}\nu _{i}\,g_{i}(\mathbf{x})\quad \forall \,i\, =\, 1,\ldots,m$$

(72)

since $g_{i},\;\forall \,i\, =\, 1,\ldots,m,$ is assumed to be differentiable, the above penalty function is differentiable.

The dynamics of this system^{Footnote 12} is defined as

$$\frac{dx_{i}} {dt} = - \frac{1} {C_{i}} \frac{\partial E(\mathbf{x})} {\partial x_{i}} \quad \quad \forall \,i\, =\, 1,\ldots,n$$

(73)

Thus, these systems are also called as gradient-based systems. These systems [77] have some of the useful properties that can be used to solve the COPs. These properties are described in the following theorems:

Theorem 9

Let the system be described by Eq. ( 73 ). The isolated equilibrium point of the system is asymptotically stable if and only if it is a strict local minimizer of function $E_{\mathrm{pen}}$.

Theorem 10

Let the system be described by Eq. ( 73 ). If the isolated equilibrium point of the system is stable, then it is a local minimizer of function $E_{\mathrm{pen}}$.

For the gradient-based systems, assuming Lipschitz continuity, Han et al. [51] proposed many other convergence properties.

One of the difficulties in applying the penalty based method in the ANNs, is the selection of various penalty parameters. Due to the physical limits on the analog circuits, c _i cannot be set to a large value. In order to overcome the difficulties of the penalty method, and to generalize the use of ANNs, Lagrange based energy function was proposed [139]. Following subsection discuss some of the results for the Lagrange-based ANNs.

3.4 Lagrange Energy Functions

Lagrange based neural networks are inspired by the saddle point theory and the duality analysis. The focus of these networks is to search for a point that satisfies K.K.T’s first order necessary conditions of optimality [12]. Therefore, analogous to the gradient descent systems, these network are also known as projective gradient systems.

In the following subsections, necessary and sufficient conditions related to Lagrange type energy functions will be presented. Consider the following optimization problem:

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ & f(\mathbf{x})& \\ \mathrm{subject\;to} :& & \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} h_{i}(\mathbf{x})& = 0\quad \forall i = 1,\ldots,m&\end{array}$$

(74)

$$\displaystyle\begin{array}{rcl} \mathbf{x}& \in {\mathbb{R}}^{n}& \\ \end{array}$$

where, $f,h_{i} : {\mathbb{R}}^{n}\mapsto \mathbb{R}$. It is assumed that both, f, h _i are continuously differentiable functions. The above problem can be converted to an unconstrained optimization problem using the Lagrange function, as:

$$L(\mathbf{x},\boldsymbol{\lambda }) = f(\mathbf{x}) +{ \boldsymbol{\lambda }}^{T}h(\mathbf{x})$$

(75)

In this subsection, some of the basic optimality condition in the form of theorems are stated without proof. Interested readers may refer to [12] for the complete discussion, with proofs.

Theorem 11

Let us assume that ${\mathbf{x}}^{{\ast}}$ is a regular point (one that satisfies constraint qualification). If ${\mathbf{x}}^{{\ast}}$ is the local minimum of problem Eq. ( 74 ), then there exists ${\boldsymbol{\lambda }}^{{\ast}} \in {\mathbb{R}}^{m}$ such that:

$$\nabla _{x}L({\mathbf{x}}^{{\ast}},{\boldsymbol{\lambda }}^{{\ast}}) = 0$$

(76)

and

$${ \mathbf{d}}^{T}\nabla _{ xx}^{2}L({\mathbf{x}}^{{\ast}},{\boldsymbol{\lambda }}^{{\ast}})\mathbf{d} \geq 0,\quad \forall \;\mathbf{d}\; \in \{\mathbf{d} : \nabla _{ x}h{({\mathbf{x}}^{{\ast}})}^{T}\mathbf{d} = 0\}$$

(77)

Theorem 12

Let us assume that $h({\mathbf{x}}^{{\ast}}) = 0$. If $\;\exists \;{\boldsymbol{\lambda }}^{{\ast}}\in {\mathbb{R}}^{m}$ such that:

$$\nabla _{x}L({\mathbf{x}}^{{\ast}},{\boldsymbol{\lambda }}^{{\ast}}) = 0$$

(78)

and

$${ \mathbf{d}}^{T}\nabla _{ xx}^{2}L({\mathbf{x}}^{{\ast}},{\boldsymbol{\lambda }}^{{\ast}})\mathbf{d} > 0,\quad \forall \;\mathbf{d}\; \in \{\mathbf{d} : \mathbf{d}\neq 0;\nabla _{ x}h{({\mathbf{x}}^{{\ast}})}^{T}\mathbf{d} = 0\}$$

(79)

then ${\mathbf{x}}^{{\ast}}$ is the local minimum of problem Eq. ( 74 ).

Similar to the case of gradient-based dynamics, Lagrange systems has the following dynamic equations:

$$\displaystyle\begin{array}{rcl} \nabla _{t}\mathbf{x}& = -\nabla _{x}L(\mathbf{x},\boldsymbol{\lambda })&\end{array}$$

(80)

$$\displaystyle\begin{array}{rcl} \nabla _{t}\boldsymbol{\lambda }& = -\nabla _{\lambda }L(\mathbf{x},\boldsymbol{\lambda })&\end{array}$$

(81)

Although, the above equations are similar to the gradient-based systems, due to the difference in type of variables (ordinary variables and Lagrange variables), the system is named as projective gradient systems. In the presence of different types of the variables, the Lagrange will decrease w.r.t. one type of the variables and increase w.r.t. the other. This leads to the saddle point theory. For example, it is very easy to see that

$$ \displaystyle \begin{array}{c} \frac{dL(\mathbf{x},\boldsymbol{\lambda })} {dt} \vert_{\boldsymbol{\lambda }=\text{constant}} =\displaystyle\sum_{i=1}^{n}\frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial x_{i}} \frac{dx_{i}} {dt} = -\displaystyle\sum_{i=1}^{n} {\left (\frac{dx_{i}} {dt} \right )}^{2} < 0 \end{array}$$

(82)

$$ \displaystyle\begin{array}{l} \frac{dL(\mathbf{x},\boldsymbol{\lambda })} {dt} {\vert}_{\mathbf{x}= \text{constant}} = \displaystyle\sum_{i=1}^{m}\frac{\partial L(\mathbf{x},\boldsymbol{\lambda})} {\partial \lambda _{i}} \frac{d\lambda _{i}} {dt} = \displaystyle\sum_{i=1}^{m}{\left(\frac{d\lambda_{i}} {dt}\right)}^{2} \geq 0 \end{array} $$

(83)

Thus, $({\mathbf{x}}^{{\ast}},{\boldsymbol{\lambda }}^{{\ast}})$ act as a saddle point of the system.

Theorem 13

If $\nabla _{xx}^{2}L(\mathbf{x},\boldsymbol{\lambda })$ is positive definite $\forall \;\mathbf{x}\; \in \; {\mathbb{R}}^{n}\;and\;\forall \;\boldsymbol{\lambda }\;\in \; {\mathbb{R}}^{m}\;$, then the proposed ANN is Lyapunov stable.

The proof of this theorem is presented by Zhang and Constantinides in [139]. From the above theorem, following corollaries can be obtained:

Corollary 1

$\mathbf{x}$ tends to a limit ${\mathbf{x}}^{{\ast}}$ at which $\nabla _{x}L({\mathbf{x}}^{{\ast}},\boldsymbol{\lambda }) = 0$.

Corollary 2

If $\nabla h_{i}({\mathbf{x}}^{{\ast}})\;\forall \;i\, =\, 1,\ldots,m$ are linearly independent, then $\boldsymbol{\lambda }$ converges to ${\boldsymbol{\lambda }}^{{\ast}}$.

Theorem 14

Let the network be described by Eqs. ( 75 ), ( 80 ), and ( 81 ). If the network is physically stable, then the ANN converges to a local minimum. The point of convergence is a Lagrange solution to problem Eq. ( 74 ).

The above restriction of the global convexity can be relaxed to a local convexity, and the following theorem can be stated:

Theorem 15

Let the network be described by Eqs. ( 75 ), ( 80 ), and ( 81 ). If following conditions hold:

1.
$\exists ({\mathbf{x}}^{{\ast}},{\boldsymbol{\lambda }}^{{\ast}})$ such that $({\mathbf{x}}^{{\ast}},{\boldsymbol{\lambda }}^{{\ast}})$ is a stationary point
2.
$\nabla _{xx}^{2}L({\mathbf{x}}^{{\ast}},{\boldsymbol{\lambda }}^{{\ast}}) > 0$
3.
${\mathbf{x}}^{{\ast}}$ is a regular point of problem Eq.(74)

the initial point is in proximity of $({\mathbf{x}}^{{\ast}},{\boldsymbol{\lambda }}^{{\ast}})$, then the ANN will asymptotically converge to $({\mathbf{x}}^{{\ast}},{\boldsymbol{\lambda }}^{{\ast}})$.

In this section, different ways to study convergence and stability performance of ANNs were illustrated. Various theorems were presented to understand the critical role of the energy functions in the convergence behavior of ANNs. However, the convergence will guarantee local optimality provided suitable energy function, and dynamics are used to update the states of neurons. For obtaining global optimality, there are numerous extensions proposed in the literature. In the following section, a discussion on the methods that can be used to escape from local minima, in the hope of finding the global minimum, is presented.

4 Escaping Local Minima

The very thought of the gradient descent method will kindle the question regarding global optimality. Classical ANNs perform gradient descent method and will hardly converge to the global minimum. In order to overcome such difficulties, various extensions have been proposed. The extension methods that may converge ANNs to global minima by escaping local minima can be broadly classified as:

Stochastic extensions
Chaotic extensions
Convexification
Hybridization

Clearly, any extension that can be applied to H-T models can be easily applied to the general penalty or the Lagrange-based ANNs. Thus, in most of the cases, the extensions for the H-T’s model will be reviewed. In the following part of this section, a brief discussion of the above extension methods will be presented.

4.1 Stochastic Extensions

There are basically four ways to include stochasticity in the H-T’s model [107]. They are:

1.
Introducing stochasticity in the transfer function
2.
Introducing noise in the synaptic weights
3.
Introducing noise in the input bias
4.
Any combination of the above three

The prominent stochastic extensions that can be applied to the classical ANNs are discussed in the following parts of this subsection.

4.1.1 Boltzmann Machines

This is the first extension of H-T’s model, which proposes a way to escape from the local minima [1, 2, 54, 55]. This is done by replacing the deterministic transfer function with a probabilistic transfer function. Let the change in energy of the ith neuron of the H-T’s model be represented by $\nabla E_{i}$. Then, irrespective of the previous state, the $i\mathrm{th}$ neuron will fire with probability p _i, which is defined as

$$p_{i} = \frac{1} {\left (1 +\exp -\nabla E_{i}/T\right )}$$

(84)

and

$$x_{i} = \Psi _{\mathrm{Blz}}(\mathcal{I}_{i}) = \left \{\begin{array}{ll} 1&\text{with probability }p_{i} \\ 0&\text{with probability }(1 - p_{i}) \end{array} \right.$$

(85)

where T is the parameter similar to the concept of temperature in simulated annealing.

4.1.2 Gaussian Machines

These are the extensions to the Boltzmann machines, where noise is introduced in the inputs [7]. The updating dynamics of an $i\mathrm{th}$ neuron in the H-T’s model will be modified as

$$\mathcal{I}_{i} =\displaystyle\sum _{ j=1}^{n}w_{ ij}x_{j} +\theta _{i} + \epsilon _{i}$$

(86)

where $\epsilon _{i}$ is the term representing the noise, which follows Gaussian distribution with mean zero and variance ${\sigma }^{2}$. The deviation of $\sigma$ depends upon the parameter T, the temperature of the Gaussian network. In addition to the noise, activation level a _i is being assigned to every neuron, which has the following dynamics:

$$\frac{da_{i}} {dt} = -\frac{a_{i}} {\tau } + \mathcal{I}_{i}$$

(87)

The external state of the $i\mathrm{th}$ neuron is determined by the following equation:

$$x_{i} = \Psi _{\mathrm{Gss}}(a_{i}) = \frac{1} {2}\left (\tan {\it \text{h}}\left ( \frac{a_{i}} {a_{0}}\right ) + 1\right )$$

(88)

where a ₀ is the additional parameter added to the Gaussian network.

4.1.3 Cauchy Machines

This system is an extension of the Gaussian machines, where instead of a Gaussian noise, a Cauchy noise is added to the system [114, 116]. The reason of replacing the Gaussian noise with the Cauchy noise is based on the observation that the Cauchy noise produces both global random flights and local random flights, whereas Gaussian noise produces only local random noise. Thus, the system has a better chance of convergence to the global minimum. The probability that a neuron will fire is given as

$$p_{i} = \frac{1} {2} + \frac{1} {\pi } \left (\arctan (\frac{\mathcal{I}_{i}} {T} )\right )$$

(89)

and

$$x_{i} = \Psi _{\mathrm{Cau}}(\mathcal{I}_{i}) = \left \{\begin{array}{ll} 1&\text{with probability }p_{i} \\ 0&\text{with probability }(1 - p_{i}) \end{array} \right.$$

(90)

where T is the temperature of the system which has the similar meaning as in the Gaussian or Boltzmann machines. In [63], comparison of lower bounds for Boltzmann and Cauchy machines is presented.

4.2 Chaotic Extensions

Let $E_{\mathrm{Hop}}$ represent the general H-T’s energy function. Then chaotic extensions for the H-T’s model can be obtained by adding an additional function to $E_{\mathrm{Hop}}$ [73, 111], which can be described as

$$E_{\mathrm{CNN}} = E_{\mathrm{Hop}} + H(\mathbf{x},W,\boldsymbol{\theta })$$

(91)

Different definitions for the function H() lead to different chaotic extensions. Some of the chaotic extension methods for the H-T’s model will be presented in the following subsection.

4.2.1 Chen and Ahira’s Model

Chen and Ahira [20, 21] proposed a chaotic addition to the energy function, which is given as

$$H_{\mathrm{CA}} = \frac{\lambda (t)} {2} \displaystyle\sum _{i}x_{i}(x_{i} - 1)$$

(92)

where $\lambda (t)$ is the decay parameter. If this term is added to H-T’s energy function, then the system’s dynamics is defined by the following equation:

$$\frac{d\mathcal{I}_{i}} {dt} = -\frac{\partial E_{\mathrm{CNN}}} {\partial x_{i}}$$

(93)

The update rule for the internal energy of this model is given as

$$\displaystyle\begin{array}{rcl} \mathcal{I}_{i}(t + 1) = k\mathcal{I}_{i}(t) +\alpha \left (\displaystyle\sum _{j=1,j\neq i}^{n}w_{ ij}x_{j}(t) +\theta _{i}\right ) - z(t)(x_{i}(t) -\theta _{0})& &\end{array}$$

(94)

$$\displaystyle\begin{array}{rcl} z(t + 1) = (1-\beta )z(t)& &\end{array}$$

(95)

where $\alpha = \nabla t$, z(t) is the self-feedback connection weight $z(t) =\lambda (t)\nabla t$, $\beta$ is the damping factor, $\theta _{0} = 1/2$, and $k = 1 -\nabla t/\tau$.

Initially, z is very high to create strong self-coupling, thereby leading to the chaotic dynamics. Chaotic dynamics acts like an exploration stage in the simulated annealing, searching for a global minima. In the later stages, z is decayed so that the system starts to converge to a stable fixed point (similar to exploitation stage in the simulated annealing).

4.2.2 Wang and Smith’s Model

Wand and Smith [124] proposed the following chaotic addition to the H-T’s energy function, which is given as

$$H_{\mathrm{WS}} = ({\beta }^{T} - 1)E_{ c}$$

(96)

where E _c is the original H-T’s energy function and $\beta$ is the parameter which takes the values between [0, 1]. If this term is added to H-T’s energy function, then the system is defined as

$$\frac{d\mathcal{I}_{i}} {dt} = -\frac{\partial E_{\mathrm{CNN}}} {\partial x_{i}}$$

(97)

The update rule of the internal energy for this model is given as

$$\displaystyle\begin{array}{rcl} \mathcal{I}_{i}(t + 1) = \left (1 -\frac{{\beta }^{T}\nabla t(0)} {\tau } \right )\mathcal{I}_{i}(t) +{ \beta }^{T}\nabla t(0)\left (\displaystyle\sum _{ j=1,j\neq i}^{n}w_{ ij}x_{j}(t) +\theta _{i}\right )& &\end{array}$$

(98)

$$\displaystyle\begin{array}{rcl}{ \beta }^{T}\nabla t(0) = \nabla t(t)& &\end{array}$$

(99)

Initially, $\nabla t$ is sufficiently large to trigger chaotic search. As $t \rightarrow \infty $, $\nabla t$ gradually decreases.

4.2.3 Chaotic Noise Model

Recently, Wang et al. [125] proposed the following chaotic addition to the H-T’s energy function, which is given as

$$H_{\eta } = -\displaystyle\sum _{i}\eta _{i}(t)x_{i}(t)$$

(100)

where $\eta _{i}$ is the noise term. In general, $\eta _{i}$ is a normalized time series. Initially, the neuron is set with the chaotic time series. If this term is added to H-T’s energy function, then the system is defined as

$$\frac{d\mathcal{I}_{i}} {dt} = -\frac{\partial E_{\mathrm{CNN}}} {\partial x_{i}}$$

(101)

The update rule of the internal state for this model is given as

$$\mathcal{I}_{i}(t + 1) = \left (1 -\frac{\nabla t} {\tau } \right )\mathcal{I}_{i}(t) + \nabla t\left (\displaystyle\sum _{j=1}^{n}w_{ ij}x_{j}(t) +\theta _{i} +\eta _{i}(t)\right )$$

(102)

This model has a linear form of function H(), when compared to the quadratic forms presented in the earlier models. This may be the reason for its superior optimization performance among the chaotic extension models.

4.3 Convexification

One of the important properties of a convex optimization problem is that the local minimum is the global minimum. Thus, convex reformulation of the problem and finding a local minimum of the reformulation will give the global minimum. If the actual model is nonconvex, then convexification leads to different approximations (or relaxations), which in turn may be useful in studying various bounds (lower and/or upper bound) of a given COP. Both penalty-based ANNs and Lagrange-based ANNs can be convexified. Convexification of COPs using penalty function depends upon the type of penalty function. Thus, it will not be discussed in this subsection. Nevertheless, convexification of Lagrange-based ANNs will be discussed. From the Theorem 15, it can be seen that the local convexity of Lagrange systems is hard to obtain in practice. However, an augmented Lagrange method can be used to convexify any Lagrange-based ANN. In this subsection, one of the convexification methods that can be used for Lagrange-based ANN is presented. Let us assume that problem Eq. (74) is nonconvex, then its augmented Lagrange can be written as

$$L_{\mathrm{Aug}}(\mathbf{x},\boldsymbol{\lambda }) = f(\mathbf{x}) +{ \boldsymbol{\lambda }}^{T}h(\mathbf{x}) + \frac{1} {2}c\vert h(\mathbf{x}){\vert }^{2}$$

(103)

Using Eq. (103), different ANNs can be designed which are known as Augmented Lagrange Neural Networks (ALNN). The state equations can be obtained as

$$\displaystyle\begin{array}{rcl} \frac{dx_{i}} {dt} = -\frac{\partial f} {\partial x_{i}} -\displaystyle\sum _{j=1}^{m}(\lambda _{ j} + ch_{j})\frac{\partial h_{j}} {\partial x_{i}} \quad \forall \;i\, =\, 1,\ldots,n& &\end{array}$$

(104)

$$\displaystyle\begin{array}{rcl} \frac{d\lambda _{j}} {dt} = h_{j},\quad \forall \;j\, =\, 1,\ldots,m& &\end{array}$$

(105)

By selecting a sufficiently large value of c, the local convexity for any Lagrange neural networks can be obtained under some restrictions (see [13]). However, due to analog circuit and implementing restrictions, the selection of c will become a critical issue [139].

4.4 Hybridiziation

ANNs have been coupled with metaheuristics like simulated annealing [103, 123–125], tabu search [43], and evolutionary algorithm [102, 126] for solving COPs. The main purpose of these hybridizations is to find the global optimal solution, by adding flexibility to the classical H-T’s model. The usual method of coupling H-T’s model with a metaheuristic is to incorporate the constraint handling technique of the ANNs in the global search technique of the metaheuristic.

Moreover, there has been research to hybridize H-T’s model with SONNs (see [108]). One of the successful implementations is illustrated by Smith et al. [109]. The experimental result from the illustration indicates that a careful and appropriate hybridization of H-T’s model with SONNs will result in a better solution methodology, which can be competitive to the state-of-the art optimization solvers.

5 General Optimization Problems

In this section, a discussion on applying ANNs to some of general optimization problems will be presented. For the sake of simplicity and convenience, following general optimization problem will be considered:

Linear programming problems
Convex programming problems
Quadratic programming problems
Nonlinear programming problems
Complementarity problems
Mixed integer programming problem

The objective of selecting mixed integer programming problem is to highlight the reader, the method of transformation between discrete and continuous variables, when applying ANNs.

5.1 Linear Programming

Linear programming (LP) is the most widely used optimization technique for solving numerous optimization problems. Dantzig [30], the father of linear programming, proposed the famous and widely used simplex method to solve LP problems. However, simplex method has an exponential complexity. A polynomial time algorithm (ellipsoid method) for solving LP problems was presented by Khachiyan [68]. Although the ellipsoid method is polynomial time algorithm, in practice simplex outperforms ellipsoid method. Later, in 1984, Karmarkar [66] proposed a more efficient algorithm than ellipsoid method, and it outperforms simplex method in few complicated problems like scheduling, routing, and planning. Based on the solution methodology, the simplex method is categorized as exterior point method, whereas Karmarkar’s algorithm is categorized as interior point method. Further studies [122] on the classification of these two fundamentally different ideas lead to a conclusion that LP problems are finite in nature. Moreover, simplex method is a finite algorithm suitable to solve LP problems. However, it was also noted that sometimes, infinite algorithms like interior point method may outperform exterior point algorithms. Thus, there has been always a curiosity among researchers to investigate the infinite algorithms. ANNs approach in solving COPs can be seen as an infinite approach.

From the demonstration of Hopfield and Tank [58], ANNs were seen as an alternative solution approaches to solve the COPs. Many studies has been conducted to solve LP problems using ANNs [25–28, 35, 67, 100, 136]. Most of these models were extensions of H-T’s continuous model. As noted in Sect. 2.7, these models will have an energy function-based model for minimization. Before presenting the models, consider the following notations for LP problem in canonical LP_c and standard LP_s form:

$$\displaystyle\begin{array}{rcl} \begin{array}{lll} \mathrm{LP}_{c}\qquad \qquad & \\ \mathrm{minimize} : & \\ &{c}^{T}x\\ \mathrm{subject\;to}:& \\ g_{i}(x) & \geq 0\quad \forall i = 1,\ldots,m\end{array} & & \qquad \qquad \begin{array}{lll} \mathrm{LP}_{s}\qquad \qquad & \\ \mathrm{minimize} : & \\ &\widetilde{{c}}^{T}x \\ \mathrm{subject\;to } : & \\ \widetilde{g}_{i}(x) & = 0\quad \forall i = 1,\ldots,m \end{array} \\ \end{array}$$

Following are the prominent models from the literature.

5.1.1 Kennedy and Chua’s Model

Kennedy and Chua’s [67] model has neurons similar to the H-T’s continuous model. The external states of this model are represented by x. The equations of updates for states and energy function are given as

$$\displaystyle\begin{array}{rcl} \nabla \mathbf{x}\, =\, {C}^{-1}(-\mathbf{c} -\nabla \mathbf{g(x)}\,\phi (\mathbf{g(x)}))& & \\ E(\mathbf{x(t)})\, =\,{ \mathbf{c}}^{T}\mathbf{x} +\displaystyle\sum _{ j=1}^{m}\displaystyle\int _{ 0}^{g_{j}(\mathbf{x})}\phi _{ j}(s)\,ds& & \\ \end{array}$$

where C is a matrix of capacitance and is usually taken as I for most of the cases. Kennedy and Chua showed that Tank and Hopfield’s model is a special case of the proposed model. Furthermore, they proved that their model will converge to a steady state, under the assumption that E is bounded from below. Selection of $\phi ()$ is similar to that of a penalty function. In [80], it was shown that $\phi ()$ is indeed an implicit form of the penalty function. The most widely used representation of $\phi ()$ is $s\;g_{i}^{-}(\mathbf{x})$, where s > 0 is the penalty weight. Moreover, $g_{i}^{-}(\mathbf{x})$ is same as $-\mathrm{min}\{g_{i}(\mathbf{x}),0\}$. Based on the idea of applying penalty function, there were many extension of neural networks for LP. Some of the prominent extensions can be seen in [100, 138]. These prominent extensions will be presented in the following part of this section.

5.1.2 Rodriquez-Vazquez et al. Model

Rodriquez-Vazquez et al. [100] proposed another model for LP. This model is described as

$$\nabla \mathbf{x}\, =\, -u(\mathbf{x})\mathbf{c} - s(1 - u(\mathbf{x}))\nabla \mathbf{g(x)}\,\mathbf{v(x)}$$

where $\mathbf{v(x)} = {[v_{1},\ldots v_{m}]}^{T}$ ,

$$\displaystyle\begin{array}{rcl} u(\mathbf{x})& =& \left \{\begin{array}{ll} 1&\text{if }g_{i}(\mathbf{x}) \geq 0\quad \forall j\, =\, 1,\ldots,m,\\ 0 &\text{otherwise} \end{array} \right. \\ v_{j}(\mathbf{x})& =& \left \{\begin{array}{ll} 1&\text{if }g_{j}(\mathbf{x}) \geq 0\\ 0 &\text{otherwise} \end{array} \right. \end{array}$$

and s > 0. This model can be viewed as a dynamic system, which can start from any arbitrary point. Initially, the model will move from an infeasible point to a feasible point. Once a feasible point is obtained, the model will try to move to an optimal point. In [45], the convergence analysis of Rodriquez-Vazquez et al. [100] model was performed. It was shown that if the problem is convex, network will converge. However, Zak et al. [138] proposed that discrete time implementation of Rodriquez-Vazquez et al. model with an adaptive step size may have problem in reaching feasible region. In the following paragraph, the model proposed by Zak et al. will be presented.

5.1.3 Zak et al. Model

This model differs from the earlier ANNs for the LP, as Zak et al. used the exact penalty function (non-differentiable) to formulate the energy function. This selection of penalty function eliminates the selection of very high penalty weight. Apart from that, all the earlier models were proposed for the canonical representation of LP. However, this model is applicable for both standard and canonical representation of LP. Although, in the theory, there is no difference in solving the problem in any representation. But in the case of neural networks, changing from one representation to other will increase in the number of neurons (due to slack or artificial variables) and may increase in the complexity of the ANN model. The proposed model can be described as

$$\displaystyle\begin{array}{rcl} \nabla \mathbf{x}\, =\, -\nabla \mathbf{E}_{p,\rho,\gamma }(\mathbf{x})& & \\ \nabla \mathbf{E}_{p,\rho,\gamma }(\mathbf{x}) =\widetilde{{ \mathbf{c}}}^{T}\mathbf{x} +\rho \nabla \|\widetilde{\mathbf{g}}(\mathbf{x})\|_{ p} +\gamma \nabla \left (\displaystyle\sum _{i=1}^{m}x_{ i}^{-}\right )& & \\ \end{array}$$

where $x_{i}^{-} = -\mathrm{min}\{x_{i},0\},\;\rho > 0,\;\gamma > 0,\;1 \leq p \leq \infty $. Although the model uses exact penalty function, this penalty function is also non-differentiable and hence has few drawbacks. In [137], convergence analysis of Zak et al.’s model is performed. It was shown that the system trajectory will converge to the solution of LP under some positivity restriction on $\rho,\;\gamma,\;p$. In fact, it is shown that equilibrium points of the ANN are solutions to the corresponding LP problem [137].

These are some of the prominent models of ANNs applied to solve LP problems. Similar to the duality theory in LP, there were attempts to solve LP using dual ANNs. Interested reader may refer to [29, 81].

5.2 Convex Programming

Consider the following convex problem:

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ & f(\mathbf{x}) & \\ \mathrm{subject\;to} :& & \\ g_{i}(\mathbf{x})& = 0\quad \forall i = 1\ldots m& \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} \mathbf{x}& \in \mathbf{X}&\end{array}$$

(106)

where $f(\mathbf{x}),g_{i}(\mathbf{x}) : {\mathbb{R}}^{n}\mapsto \mathbb{R}$ are any convex functions of $\mathbf{x}$. In order to apply ANNs to solve this problem, the problem is reformulated into unconstrained optimization problem using penalty function. Let $\Phi ()$ be the penalty function which has the following properties:

$$\Phi (\mathbf{x})\left \{\begin{array}{ll} = 0&\text{if x is feasible} \\ > 0 &\text{otherwise} \end{array} \right.$$

(107)

Then, consider the following transformed convex problem:

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ & \mathbf{E}(\mathbf{x},s)& \\ \mathrm{subject\;to} :& & \\ \mathbf{x}& \in \mathbf{X} & \\ \end{array}$$

where $E(\mathbf{x},s) = f(\mathbf{x}) + s\sum _{i=1}^{m}\Phi (g_{i}(\mathbf{x}))$.

The update equation is

$$\displaystyle\begin{array}{rcl} \nabla \mathbf{x}\, =\, -\eta \,\nabla \mathbf{E}(\mathbf{x},s)& &\end{array}$$

(108)

$$\displaystyle\begin{array}{rcl} \nabla \mathbf{E}(\mathbf{x},s)\, =\, \nabla f(\mathbf{x}) + s\displaystyle\sum _{i=1}^{m}\phi (g_{ i}(\mathbf{x}))\,\nabla g_{i}(\mathbf{x})& & \\ \end{array}$$

The stability and convergence analysis of this model has been performed in [22]. It has been shown that dynamics given by Eq. (108) leads to an equilibrium point.

5.3 Quadratic Programming

Similar to the linear programming, quadratic programming problems can be modeled using ANNs. Consider the following quadratic programming problem:

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ &{ \mathbf{x}}^{T}H\mathbf{x} + {c}^{T}\mathbf{x}& \\ \mathrm{subject\;to} :& & \\ A\mathbf{x}& = \mathbf{b} & \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} \mathbf{x}& \in \mathbf{X}&\end{array}$$

(109)

where $\mathbf{x} \in {\mathbb{R}}^{n}$ $A,H \in {\mathbb{R}}^{n\times n}$. This program can be converted to an unconstrained problem using a penalty function method or a Lagrange method. A Lagrange method will be presented in the following part of this section. The Lagrange of problem Eq. (109) is written as

$$L(\mathbf{x},\boldsymbol{\lambda }) ={ \mathbf{x}}^{T}H\mathbf{x} + {c}^{T}\mathbf{x} +{ \boldsymbol{\lambda }}^{T}(A\mathbf{x} -\mathbf{b})$$

(110)

The equations of motion that describe this model are

$$\displaystyle\begin{array}{rcl} \nabla _{t}\mathbf{x} = -\nabla _{x}L(\mathbf{x},\boldsymbol{\lambda })& &\end{array}$$

(111)

$$\displaystyle\begin{array}{rcl} \nabla _{t}\boldsymbol{\lambda } = -\nabla _{\lambda }L(\mathbf{x},\boldsymbol{\lambda })& &\end{array}$$

(112)

where

$$\displaystyle\begin{array}{rcl} \nabla _{t}\mathbf{x}& =&{ \left (\frac{dx_{1}} {dt}, \frac{dx_{2}} {dt},\ldots, \frac{dx_{n}} {dt} \right )}^{T},\quad x \in {\mathbb{R}}^{n} \\ \nabla _{T}\boldsymbol{\lambda }& =&{ \left (\frac{d\lambda _{1}} {dt}, \frac{d\lambda _{2}} {dt},\ldots, \frac{d\lambda _{m}} {dt} \right )}^{T},\quad \lambda \in {\mathbb{R}}^{m} \\ \nabla _{x}L(\mathbf{x},\boldsymbol{\lambda })& =&{ \left (\frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial x_{1}}, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial x_{2}},\ldots, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial x_{n}} \right )}^{T} \\ \end{array}$$

and

$$\displaystyle\begin{array}{rcl} \nabla _{\lambda }L(\mathbf{x},\boldsymbol{\lambda }) ={ \left (\frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial \lambda _{1}}, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial \lambda _{2}},\ldots, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial \lambda _{m}} \right )}^{T}& & \\ \end{array}$$

The above model is iterative Lagrangian neural network. The convergence analysis of this model is presented in Sect. 3 and in [139]. Apart from that, quadratic programming problems with ANNs have been exhaustively studied in [39, 80, 134, 135].

5.4 Nonlinear Programming

Similar to the quadratic and convex programming, general nonlinear programming problems can be solved using ANNs. One of the ways is to use a penalty method approach [67]. Another approach is based on the Lagrange method, presented by Zhang and Constantinides [139]. The main differences with penalty function method and the advantages of Lagrange methods are discussed in Sect. 3 of this chapter. In the following part of this subsection, the method of writing a Lagrange for any generalized nonlinear problem will be presented.

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ & f(\mathbf{x}) & \\ \mathrm{subject\;to} :& & \\ h_{i}(\mathbf{x})& = 0\quad \forall i = 1\ldots m& \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} \mathbf{x}& \in {\mathbb{R}}^{n}&\end{array}$$

(113)

where $f(\mathbf{x}),h_{i}(\mathbf{x}) : {\mathbb{R}}^{n}\mapsto \mathbb{R}$ are any continuous and twice differentiable functions of $\mathbf{x}$. The Lagrange of the above COP is written as

$$L(\mathbf{x},\boldsymbol{\lambda }) = f(\mathbf{x}) +{ \boldsymbol{\lambda }}^{T}h(\mathbf{x})$$

(114)

The equations of motion that describes this model are

$$\displaystyle\begin{array}{rcl} \nabla _{t}\mathbf{x} = -\nabla _{x}L(\mathbf{x},\boldsymbol{\lambda })& &\end{array}$$

(115)

$$\displaystyle\begin{array}{rcl} \nabla _{t}\boldsymbol{\lambda } = -\nabla _{\lambda }L(\mathbf{x},\boldsymbol{\lambda })& &\end{array}$$

(116)

where

$$\displaystyle\begin{array}{rcl} \nabla _{t}\mathbf{x} ={ \left (\frac{dx_{1}} {dt}, \frac{dx_{2}} {dt},\ldots, \frac{dx_{n}} {dt} \right )}^{T},x \in {\mathbb{R}}^{n}& & \\ \nabla _{t}\boldsymbol{\lambda } ={ \left (\frac{d\lambda _{1}} {dt}, \frac{d\lambda _{2}} {dt},\ldots, \frac{d\lambda _{m}} {dt} \right )}^{T},\lambda \in {\mathbb{R}}^{m}& & \\ \nabla _{x}L(\mathbf{x},\boldsymbol{\lambda }) ={ \left (\frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial x_{1}}, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial x_{2}},\ldots, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial x_{n}} \right )}^{T}& & \\ \end{array}$$

and

$$\nabla _{\lambda }L(\mathbf{x},\boldsymbol{\lambda }) ={ \left (\frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial \lambda _{1}}, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial \lambda _{2}},\ldots, \frac{\partial L(\mathbf{x},\boldsymbol{\lambda })} {\partial \lambda _{m}} \right )}^{T}$$

The above model is iterative, similar to that of H-T’s continuous model. Both the types of neuron are decision variables. Lagrangian variables tend to take the solution to a feasible space, whereas the ordinary variables will take the solution to an optimal value. The mathematical properties and proofs related to convergence of this model have been discussed in Sect. 3.

5.5 Complementarity Problem

5.5.1 Linear Complementarity Problem (LCP)

Given a matrix $M \in {\mathbb{R}}^{n\times N}$ and a vector $\mathbf{q} \in {\mathbb{R}}^{n}$, the LCP can be stated as:

$$\displaystyle\begin{array}{rcl} \mathrm{find} :& & \\ & \mathbf{x} & \\ \mathrm{subject\;to} :& & \\ M\mathbf{x} + \mathbf{q}& \geq 0& \\ {x}^{T}(M\mathbf{x} + \mathbf{q})& = 0& \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} \mathbf{x}& \geq 0&\end{array}$$

(117)

where $x \in {\mathbb{R}}^{n}$. This problem can be solved by ANNs, by converting it to a nonlinear problem by using Fischer function. Let $\phi (a,b)$ be the Fischer function, defined as $\phi (a,b) = \sqrt{{a}^{2 } + {b}^{2}} - a - b$. The purpose of using this function is that it has an important property, which can be stated as

$$\phi (a,b) = 0 \Leftrightarrow a \geq 0,\;b \geq 0\;\text{and}\;ab = 0$$

(118)

Thus, this property is exploited while solving LCP using ANN. The above property mimics the LCP, which can be represented as

$$\phi _{i}(x_{i},F_{i}(\mathbf{x})) = 0\quad \forall \;i$$

(119)

where $F_{i}(\mathbf{x}) = x_{i}(M\mathbf{x} + \mathbf{q})_{i}$. Let $\Phi (\mathbf{x})$ represent a vector, where the $i\mathrm{th}$ element is $\phi _{i}(x_{i},F_{i}(\mathbf{x}))$. Then, solving LCP is same as solving the following problem:

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ & F(\mathbf{x})&\end{array}$$

(120)

where

$$F(\mathbf{x}) = \frac{1} {2}{\vert \vert \Phi (\mathbf{x})\vert \vert }^{2}$$

(121)

Now, the above problem can be solved by ANNs as a quadratic programming problem, using either a H-T’s model or Lagrange model.

5.5.2 Nonlinear Complementarity Problem (NCP)

Given a set of functions $F_{i} : {\mathbb{R}}^{n}\mapsto {\mathbb{R}}^{n},\quad \forall \;i\, =\, 1,\ldots,n$, the NCP can be stated as

$$\displaystyle\begin{array}{rcl} \mathrm{find} :& & \\ & \mathbf{x} & \\ \mathrm{subject\;to} :& & \\ F_{i}(\mathbf{x}) \geq 0& & \\ x_{i}F(\mathbf{x})& = 0& \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} \mathbf{x}& \geq 0&\end{array}$$

(122)

where $x \in {\mathbb{R}}^{n}$ and F _i is assumed to be continuously differentiable function. ANNs can be used to solve NCP by using the idea of Fischer function [75]. Similar to the LCP case, define a vector $\Phi (\mathbf{x})$, where each $i\mathrm{th}$ element is represented by

$$\phi _{i}(x_{i},F_{i}(\mathbf{x})) = 0\quad \forall \;i$$

(123)

Thus, solving NCP is same as solving the following problem:

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ & E(\mathbf{x})&\end{array}$$

(124)

where

$$E(\mathbf{x}) = \frac{1} {2}{\vert \vert \Phi (\mathbf{x})\vert \vert }^{2}$$

(125)

Now, this above problem can be solved by ANNs using a Lagrange model.

5.6 Mixed Integer Programming Problems (MIPs)

The difficulty in solving any optimization problems is not due to the nonlinear (or discrete) representation of the problem. The criteria that separate difficult and easy problems are the convexity of the problem. A convex problem in a linear or nonlinear representation is tractable, whereas a nonconvex problem in any representation is difficult to solve. By presenting the example of a general MIPs, we would like to highlight the thin line that separates a continuous optimization problem from a discrete optimization problem while applying ANNs.

Consider the following MIPs defined as

$$\displaystyle\begin{array}{rcl} \mathrm{minimize} :& & \\ & f(\mathbf{x,y}) & \\ \mathrm{subject\;to} :& & \\ g_{i}(\mathbf{x,y})& \leq 0\quad \forall i = 1,\ldots,m & \\ h_{j}(\mathbf{x,y})& = 0\quad \forall j = 1,\ldots,p & \\ l_{k} \leq x_{k}& \leq u_{k}\quad \forall k = 1,\ldots,n& \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} y_{r}\; \in \{ 0&,1\}\quad \forall r = 1,\ldots,q&\end{array}$$

(126)

Similar to any previous approaches, this problem can be converted to unconstrained optimization problem using either a penalty function or a Lagrange function. For the sake of completeness, both approaches are illustrated in the following part of this subsection.

5.6.1 Penalty Function Approach

The modified (penalized) energy function for problem Eq. (126) can be written as

$$\displaystyle\begin{array}{rcl} E_{\text{mip}}(\mathbf{x,y})& =& C_{1}\,f(\mathbf{x,y}) + C_{2}\displaystyle\sum _{i=1}^{m}\Phi [g_{ i}(\mathbf{x,y})] \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} & & +C_{3}\displaystyle\sum _{j=1}^{p}h_{ j}^{2}(\mathbf{x,y}) + C_{ 4}\displaystyle\sum _{r=1}^{q}y_{ r}(1 - y_{r})\end{array}$$

(127)

where $C_{1},\;C_{2},\;C_{3},\;$ and C ₄ are scaling constants, used as the penalty parameters. $\Phi $ is the penalty function. The dynamics of the network defined by Eq. (127) can be described as

$$\displaystyle\begin{array}{rcl} \frac{d\mathcal{I}_{k}^{x}} {dt} & =& -\eta _{x}\frac{\partial E_{\mathrm{mip}}} {\partial x_{k}} \quad \forall k = 1,\ldots,n\end{array}$$

(128)

$$\displaystyle\begin{array}{rcl} \frac{d\mathcal{I}_{r}^{y}} {dt} & =& -\eta _{y}\frac{\partial E_{\mathrm{mip}}} {\partial y_{r}} \quad \forall r = 1,\ldots,q\end{array}$$

(129)

$$\displaystyle\begin{array}{rcl} x_{k}& =& \Psi _{c}(\mathcal{I}_{k}^{x})\quad \forall k = 1,\ldots,n\end{array}$$

(130)

$$\displaystyle\begin{array}{rcl} y_{r}& =& \Psi _{d}(\mathcal{I}_{r}^{y})\quad \forall r = 1,\ldots,q\end{array}$$

(131)

where $\eta _{x}$ and $\eta _{y}$ are positive scaling coefficients for the dynamics of the system. In addition to that, $\Psi _{d}$ can be replaced by the tuned $\Psi _{c}$ that results in the discrete values of y _r.

5.6.2 Lagrange Function Approach

The Lagrange of problem Eq. (126) can be written as

$$\displaystyle\begin{array}{rcl} L_{\mathrm{mip}}(x,y,\mu {,\nu }^{1}{,\nu }^{2})& =& f(x,y) +\displaystyle\sum _{ i=1}^{m}\mu _{ i}g_{i}(x,y) \\ & & +\displaystyle\sum _{j=1}^{p}\nu _{ j}^{1}h_{ j}(x,y) +\displaystyle\sum _{ r=1}^{q}\nu _{ r}^{2}y_{ r}(1 - y_{r})\end{array}$$

(132)

where $\mu _{i} \geq 0\quad \forall \;i\; =\; 1\ldots m$ and $\nu _{j}^{1},\,\nu _{r}^{2} \in \; \mathbb{R}\quad \forall \;j\; =\; 1\ldots p\;$ and $\;\forall \;r\; =\; 1\ldots q$. Similar to penalty function approach, the system dynamics can be described as

$$\displaystyle\begin{array}{rcl} \nabla _{t}\mathbf{x} = -\nabla _{x}L_{\mathrm{mip}}(\mathbf{x,y,\boldsymbol{\mu {,\nu }^{1}{,\nu }^{2}}})& &\end{array}$$

(133)

$$\displaystyle\begin{array}{rcl} \nabla _{t}\mathbf{y} = -\nabla _{y}L_{\mathrm{mip}}(\mathbf{x,y,\boldsymbol{\mu {,\nu }^{1}{,\nu }^{2}}})& &\end{array}$$

(134)

$$\displaystyle\begin{array}{rcl} \nabla {_{t}\nu }^{1} = \nabla _{{\nu }^{ 1}}L_{\mathrm{mip}}(\mathbf{x,y,\boldsymbol{\mu {,\nu }^{1}{,\nu }^{2}}})& &\end{array}$$

(135)

$$\displaystyle\begin{array}{rcl} \nabla {_{t}\nu }^{2} = \nabla _{{\nu }^{ 2}}L_{\mathrm{mip}}(\mathbf{x,y,\boldsymbol{\mu {,\nu }^{1}{,\nu }^{2}}})& &\end{array}$$

(136)

$$\displaystyle\begin{array}{rcl} \frac{d\mu _{i}} {dt} = \left \{\begin{array}{ll} 0 & \text{for} \mu_{i} = 0\;\text{\&}\;\frac{\partial L_{\mathrm{mip}}(\boldsymbol{x,y,\mu {,\nu }^{1}{,\nu }^{2}})} {\partial \mu _{i}} < 0 \\ \frac{\partial L_{\mathrm{mip}}(\boldsymbol{x,y,\mu {,\nu }^{1}{,\nu }^{2}})} {\partial \mu _{i}} & \mathrm{otherwise} \end{array} \right.\quad \forall i& &\end{array}$$

(137)

In [127], penalty-based approach is applied to solve factory allocation problem, and to solve the unit commitment problem. Direct use of inequality constraints in Lagrange programming ANNs have been introduced in [61, 79]. In Eq. (137), a simple method of update for inequality constrained is illustrated from [79]. A primitive way to deal with the inequality constrains is to transform them into equality constraints. Another novel approach for direct usage of inequality constraints is to use the square of the Lagrange multiplier for the inequality constraints [61].

From this section, it should be clear that ANNs are not limited for solving either a linear or a quadratic programming problem. With development of penalty- and Lagrange-based methods for the ANNs, they can be applied to any typical optimization problem. With the ease of transformation from continuous to discrete variable space, they can be appropriately designed for any COPs. In the following section, some of the applications of ANNs in solving discrete optimization problems will be surveyed.

6 Discrete Optimization Problems

Discrete optimization problems find their application in the problems related to transportation, scheduling, sorting, networking, etc. In this section, various methods of mapping the discrete optimization problems to the ANNs will be presented.

6.1 Graph Problems

Graph problems are one of the extensively studied and applied problems in operations research. The goal of this subsection is to discuss these problems and their mapping onto ANNs [96].

6.1.1 Graph Partitioning Problem (GPP)

Problem Statement: Given a graph G = (V, E), where V represents the vertices, | V | = n and E represents the edges, | E | = m. Let w _i be the weight on each vertex. Let e _ij be the weight on the link connecting $i\mathrm{th}$ and $j\mathrm{th}$ vertices. The problem is to partition the graph into two partitions of nearly equal weights, such that the cut-size is minimized. In other words, the problem is to bisect the graph, such that the sum of weights of the vertices assigned to the partitions is nearly equal, while the sum of weights on the links connecting the partitions is minimum.

Solution Methodology: A continuous H-T’s model was used to solve this Problem [9]. Let x _i represent the external state of the neuron, which can take any values between 0 and 1. All the neurons that have the value of 0 are assigned to the first partition. The others neurons, which have the value of 1, were assigned to the second partition. The objective function for GPP is given as

$$\displaystyle\begin{array}{rcl} E_{\mathrm{GPP}}& =& \frac{1} {2}\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}e_{ ij}(x_{i} + x_{j} - 2x_{i}x_{j}) +\displaystyle\sum _{ i=1}^{n}\displaystyle\sum _{ j=1}^{n}w_{ i}w_{j}(1 - x_{i} - x_{j} + 2x_{i}x_{j}) \\ & & \end{array}$$

(138)

The first term of the objective function aims at minimizing the weighted sum of the edges which belong to the cut, whereas the second term of the objective function aims at balancing the weights in both the partitions. When the $E_{\mathrm{GPP}}$ is compared with E _c, the following values for the synaptic weights and the input bias are obtained:

$$W_{ij} = 2e_{ij} - 4w_{i}w_{j}\quad \text{and}\quad \theta _{i} = -\displaystyle\sum _{j=1}^{n}e_{ ij} + 2w_{i}\displaystyle\sum _{j=1}^{n}w_{ j}$$

(139)

Thus, the H-T’s model is complete. The above system is solved to get the solution of GPP.

6.1.2 Graph K-Partitioning Problem (GKP)

Problem Statement: Given a graph G = (V, E), where V represents the vertices, | V | = n and E represents the edges, | E | = m. Let w _i be the weight on each vertex. Let e _ij be the weight on the link connecting $i\mathrm{th}$ and $j\mathrm{th}$ vertices. This problem is a generalized version of GPP, where the objective is to partition given graph with similar description as GPP into K partitions. In general, K is not the multiple of 2.

Solution Methodology: This problem can be formulated as an assignment problem, where each node is assigned to only one partition. Since there are K partitions and n nodes, there will be N = nK number of variables. Since, in H-T’s model, each variable represents a neuron, there will be altogether N neurons. Let x _ij be the external state of the neuron, which can take value between 0 and 1.

The objective function with penalty parameters is written as

$$\displaystyle\begin{array}{rcl} E_{\mathrm{GKP}}& =& \frac{1} {2}\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{K}\displaystyle\sum _{ k=1,k\neq j}^{K}x_{ ij}x_{jk} + \frac{1} {2}{\left (\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{K}x_{ ij} - n\right )}^{2} \\ & & +\frac{1} {2}\displaystyle\sum _{i,j=1}^{n}\displaystyle\sum _{ k,l=1}^{K}e_{ ij}(x_{ik} + x_{jl} - 2x_{ik}x_{jl}) \\ & & +\displaystyle\sum _{k=1}^{K}\displaystyle\sum _{ i=1}^{n}\displaystyle\sum _{ j=1}^{n}x_{ ik}x_{jk}w_{i}w_{j} \end{array}$$

(140)

When the $E_{\mathrm{GKP}}$ is compared with E _d, the following values for the weights and input bias are obtained:

$$W_{ij,kl} = -\delta _{ij}(1 -\delta _{ij}) - 1 + 2e_{ij} - 2w_{i}w_{j}\quad \mathrm{and}\quad \theta _{i} = +n -\displaystyle\sum _{j=1}^{n}e_{ ij}$$

(141)

where W _{ij, kl} is the synaptic weight between neuron ij ^{Footnote 13} and neuron kl. Thus, with the above mapping, the H-T’s model is complete. The above system is solved by using Algorithm 1 to get the solution of GKP. A higher order ANNs for solving GKP was presented in [40].

6.1.3 Vertex Cover Problem (VCP)

Problem Definition: Given a graph G = (V, E), where V represents the vertices, | V | = n and E represents the edges, | E | = m. Let A be the adjacency matrix, where a _ij = 1 if there is an edge between $i\mathrm{th}$ and $j\mathrm{th}$ vertices, otherwise $a_{ij} = 0$. A subset C of the vertices V of the graph G is called the vertex cover if all the edges of G are adjacent to at least one vertex in the set C. The VCP is to find the vertex cover, such that the cardinality of set C is minimum.

Solution Methodology: Let

$$x_{i} = \left \{\begin{array}{ll} 1&\text{if vertex }i \in C\\ 0 &\text{otherwise} \end{array} \right.$$

(142)

Since there are | V | = n vertices in G, there will be n neurons used to design the model. The modified objective function designed for H-T’s model will be

$$E_{\mathrm{VCP}} = C_{1}\left (\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}x_{ i}x_{j}\right ) -\frac{C_{2}} {2} \left (\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}(1 - x_{ i})(1 - x_{j})a_{ij}\right )$$

(143)

where C ₁ and C ₂ are penalty terms whose values decide the relative importance of the constraints and the objective function. The first term in $E_{\mathrm{VCP}}$ ensures the minimality of the cover. The second term (is the penalty term) will be zero if the nodes which are not in the vertex cover have no edges between them. By comparing the coefficients of $E_{\mathrm{VCP}}$ and E _d, the synaptic weights and the input bias are obtained as

$$W_{ij} = -2C_{1} - C_{2}a_{ij}\quad \text{and}\quad \theta _{i} = C_{2}\displaystyle\sum _{j=1}^{n}a_{ ij}$$

(144)

Thus, with the above mapping, the H-T’s model is complete. The above system is solved by using Algorithm 1 to get the solution of GKP.

6.1.4 Maximum Independent Set Problem (MISP)

Problem Definition: Given a graph G = (V, E), where V represents the vertices, | V | = n and E represents the edges, | E | = m. Let A represent the adjacency matrix, where a _ij = 1 if there is an edge between $i\mathrm{th}$ and $j\mathrm{th}$ vertices, otherwise a _ij = 0. A subset I of the vertices V of the graph G is called the independent set if there is no edge between any pair of vertices which belong to I. The MISP is to find an independent set with the maximum cardinality.

Solution Methodology: Let

$$x_{i} = \left \{\begin{array}{ll} 1&\text{if vertex }i \in I\\ 0 &\text{otherwise} \end{array} \right.$$

(145)

Since there are | V | = n vertices in G, there will be n neurons used to design the model. The modified objective function designed for H-T’s model will be

$$E_{\mathrm{MISP}} = -C_{1}\left (\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}x_{ i}x_{j}\right ) + \frac{C_{2}} {2} \left (\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}x_{ i}x_{j}a_{ij}\right )$$

(146)

where C ₁ and C ₂ are penalty terms whose values decide the relative importance of the constraints and the objective function. The first term in $E_{\mathrm{MISP}}$ minimizes cardinality of the set I. The second term will ensure that the set I forms an independent set. By comparing the coefficients of $E_{\mathrm{MISP}}$ and E _d, the synaptic weights and the input bias are obtained as

$$W_{ij} = 2C_{1} - C_{2}a_{ij}\quad \mathrm{and}\quad \theta _{i} = 0$$

(147)

Thus, with the above mapping, the H-T’s model is complete. The above system is solved by using Algorithm 1 to get the solution of MISP.

6.1.5 Maximum Clique Set Problem (MCSP)

Problem Definition: Given a graph G = (V, E), where V represents the vertices, | V | = n and E represents the edges, | E | = m. Let A represent the adjacency matrix, where a _ij = 1 if there is an edge between $i\mathrm{th}$ and $j\mathrm{th}$ vertices, otherwise a _ij = 0. A subset of the vertices V of the graph G is called clique if every pair of vertices in the subset is connected by an edge. The MSCP is to find a clique with the maximum cardinality.

Solution Methodology: Let

$$x_{i} = \left \{\begin{array}{ll} 1&\text{if vertex }i \in clique\\ 0 &\text{otherwise} \end{array} \right.$$

(148)

Since there are | V | = n vertices in G, there will be n neurons used to design the model. Let $a_{ij}^{c} = 1 - a_{ij}$, then with this modification, MSCP will be same as MISP. The modified objective function designed for H-T’s model will be

$$E_{\mathrm{MCSP}} = -C_{1}\left (\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}x_{ i}x_{j}\right ) + \frac{C_{2}} {2} \left (\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}x_{ i}x_{j}a_{ij}^{c}\right )$$

(149)

where C ₁ and C ₂ are penalty terms whose values decide the relative importance of the constraints and the objective function. The first term in $E_{\mathrm{MCSP}}$ minimizes cardinality of the clique set. The second term will ensure that the selected set forms a clique. By comparing the coefficients of $E_{\mathrm{MCSP}}$ and E _d, the synaptic weights and the input bias are obtained as

$$W_{ij} = 2C_{1} - C_{2}a_{ij}^{c}\quad \mathrm{and}\quad \theta _{ i} = 0$$

(150)

Thus, with the above mapping, the H-T’s model is complete. The above system is solved by using Algorithm 1 to get the solution of MCSP.

6.1.6 Graph Coloring Problem (GCP)

Problem Definition: Given a graph G = (V, E), where V represents the vertices, | V | = n and E represents the edges, | E | = m. Let A represent the adjacency matrix, where a _ij = 1 if there is an edge between $i\mathrm{th}$ and $j\mathrm{th}$ vertices, otherwise a _ij = 0. The GCP is to find the minimum number of subsets of V such that no two adjacent vertices (two vertices connected by an edge) belong to the same subset. In other words, find the minimum number of different colors needed to color the graph such that no two adjacent vertices are of the same color.

Solution Methodology: Let

$$x_{ij} = \left \{\begin{array}{ll} 1&\text{if vertex }i\text{is colored in the color }j\\ 0 &\text{otherwise} \end{array} \right.$$

(151)

Since there are | V | = n vertices in G, from the graph theory, the maximum number of different colors needed to color the graph will be one plus the maximum degree of any vertex. Let $\nu$ represent the maximum number of different colors. Thus, there will be $n\nu$ neurons used to design the model. The modified objective function designed for H-T’s model will be

$$E_{\mathrm{GCP}} = \frac{1} {2}\left (\displaystyle\sum _{k=1}^{\nu }\displaystyle\sum _{ i=1}^{n}\displaystyle\sum _{ j=1}^{n}x_{ ik}x_{jk}a_{ij}\right ) -\left (\displaystyle\sum _{k=1}^{\nu }\displaystyle\sum _{ i=1}^{n}\displaystyle\sum _{ j=1}^{n}x_{ ik}x_{jk}\right )$$

(152)

By comparing the coefficients of $E_{\mathrm{GCP}}$ and E _d, the synaptic weights and the input bias are obtained as

$$W_{ij} = 1 - a_{ij}/2\quad \mathrm{and}\quad \theta _{i} = 0$$

(153)

Thus, with the above mapping, the H-T’s model is complete. The above system is solved by using Algorithm 1 to get the solution of GCP.

6.1.7 Graph Matching Problem (GMP)

Problem Definition: Given a graph G = (V, E), where V represents the vertices, | V | = n and E represents the edges, | E | = m. Let B be the incidence matrix, where b _ij = 1 if $j\mathrm{th}$ edge is incident on $i\mathrm{th}$ vertex, otherwise b _ij = 0. A subset M of the edges E of the graph G is called matching if there are no two edges in M that share a common node. The GMP is to find a matching with the maximum cardinality.

Solution Methodology: Let

$$x_{i} = \left \{\begin{array}{ll} 1&\text{if edge }i \in M\\ 0 &\text{otherwise} \end{array} \right.$$

(154)

Since there are | E | = m vertices in G, there will be m neurons used to design the model. The modified objective function designed for H-T’s model will be

$$E_{\mathrm{GMP}} = -\left (\displaystyle\sum _{i=1}^{m}x_{ i}\right ) + \left (\displaystyle\sum _{i=1}^{m}\displaystyle\sum _{ j=1}^{m}x_{ i}x_{j}\displaystyle\sum _{k=1}^{n}b_{ ki}b_{kj}\right )$$

(155)

By comparing the coefficients of $E_{\mathrm{GMP}}$ and E _d, the synaptic weights and the input bias are obtained as

$$W_{ij} = -2\displaystyle\sum _{k=1}^{n}b_{ ki}b_{kj}\quad \mathrm{and}\quad \theta _{i} = 1$$

(156)

Thus, with the above mapping, the H-T’s model is complete. The above system is solved by using Algorithm 1 to get the solution of GMP.

6.2 Shortest Path Problems

Problem Description: Given a network G = (V, E), where V represents the vertices or cities, | V | = n and E represents the edges or paths, | E | = m. Let the distance between the cities be represented by a distance matrix D, where d _ij represents the distance between the $i\mathrm{th}$ city and the $j\mathrm{th}$ city. The problem is to find the shortest path between two given pair of cities. Let s be the origin and e be the destination.

Solution Methodology: Let

$$x_{ij} = \left \{\begin{array}{ll} 1&\text{if edge }(i,j)\text{ is in the path }\\ 0 &\text{otherwise} \end{array} \right.$$

(157)

Since there are n cities, there will be n neurons used to design the model. The modified objective function designed for H-T’s model will be

$$\displaystyle\begin{array}{rcl} E_{\mathrm{SPP}}(t)\,=\,C_{1}\left (\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j\neq i}d{_{ij}\exp }^{-t/\tau }x_{ ij}\right )\,+\,\frac{C_{2}} {2} \displaystyle\sum _{i=1}^{n}\left (\displaystyle\sum _{ k\neq i}x_{ik}\,-\,\displaystyle\sum _{l\neq i}x_{li} -\delta _{is}\,+\,\delta _{ie}\right )& & \\ & &\end{array}$$

(158)

By comparing the coefficients of $E_{\mathrm{SPP}}$ and E _c, the synaptic weights and input bias are obtained as

$$W_{ij} = 2 {\ast} C_{2}\quad \mathrm{and}\quad \theta _{i} = -2 {\ast} C_{2}$$

(159)

Thus, with the above mapping, the H-T’s model is complete. The above system is solved by using Algorithm 2 to get the solution of SPP.

6.3 Number Partitioning Problems (NPP)

Problem Definition: Given a set of n numbers $a_{1},a_{2},\ldots,a_{n} \in S$, partition the set S into two sets S ₁ and S ₂ such that the following cost function is minimized:

$$C(S_{1},S_{2}) = \left \vert \displaystyle\sum _{i\in S_{1}}a_{i} -\displaystyle\sum _{i\in S_{2}}a_{i}\right \vert $$

(160)

Solution Methodology: Let

$$x_{i} = \left \{\begin{array}{ll} 1&\text{if edge }i \in S_{2} \\ 0&\text{if edge }i \in S_{1} \end{array} \right.$$

(161)

Since there are n numbers, there will be n neurons used to design the model. The modified objective function designed for H-T’s model will be

$$E_{\mathrm{NPP}} = \left (\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}(1 + 2x_{ i}x_{j} - x_{i} - x_{j})a_{i}a_{j}\right )$$

(162)

The term $(1 + 2x_{i}x_{j} - x_{i} - x_{j})$ ensures that only numbers which belong to the same partition will contribute to the sum. By comparing the coefficients of $E_{\mathrm{GMP}}$ and E _d, the synaptic weights and input bias are obtained as

$$W_{ij} = -4a_{i}a_{j}\quad \mathrm{and}\quad \theta _{i} =\displaystyle\sum _{ i=1}^{n}a_{ i}$$

(163)

Thus, with the above mapping, the Hopfield model is complete. The above system is solved by using Algorithm 1 to get the solution of GMP.

6.4 Assignment Problems

Problem Definition: Given n entities, n cells, and the cost c _ij of assigning $i\mathrm{th}$ entity to $j\mathrm{th}$ cell. The linear assignment problem (LAP) is to assign each entity to one cell, such that the assignment cost is minimized.

Solution Methodology: Since there are n entities and n cells, there will be altogether n ² neurons used for designing ANN. Let x _ij represent the external state of H-T’s model. Let $\mathcal{I}_{ij}$ represent the internal state of H-T’s neuron, such that

$$x_{ij} = \left \{\begin{array}{ll} 1&\text{if entity }i\text{is assigned to cell }j\\ 0 &\text{otherwise} \end{array} \right.$$

(164)

The dynamics of this system is given as

$$\displaystyle\begin{array}{rcl} \frac{d\mathcal{I}_{ij}(t)} {dt} = -C_{1}\displaystyle\sum _{k=1}^{n}v_{ kj}(t) - C_{1}\displaystyle\sum _{k=1}^{n}v_{ ik}(t) + 2C_{1} - C_{2}c_{ij}\exp (-t/\tau )& &\end{array}$$

(165)

$$\displaystyle\begin{array}{rcl} x_{ij}(t) = f_{ij}\left (\mathcal{I}_{ij}(t)\right )& &\end{array}$$

(166)

From the above equations, the weights and input bias of H-T’s model are defined as

$$w_{ij} =\delta _{ij}2C_{1} + (1 -\delta _{ij})C_{1}\quad \mathrm{and}\quad \theta _{ij} = 2C_{1}$$

(167)

Thus, with the above mapping, the H-T’s model is complete. The above system is solved by using Algorithm 2 to get the solution of LAP.

6.5 Sorting Problems

Problem Definition: Given n real numbers $a_{1},a_{2},\ldots,a_{n}$, the problem is to order them in ascending, descending, or biotonic order (where both the ends have higher numbers).

This problem can be formulated as an assignment problem, that is, given n numbers that are to be assigned to n cells. The cost c _ij of assigning $i\mathrm{th}$ number to $j\mathrm{th}$ cell is given by $c_{ij} = a_{i}c_{j}$, where c _j is an arbitrary sequence of real numbers, which are selected based on the sense of sorting (ascending, descending, or biotonic).

Solution Methodology: Since there are n numbers and n cells, there will be altogether n ² neurons used for designing ANN. Let x _ij represent the external state of H-T’s continuous model. Let $\mathcal{I}_{ij}$ represent the internal state of H-T’s model such that

$$x_{ij} = \left \{\begin{array}{ll} 1&\text{if entity }i\text{is assigned to cell }j\\ 0 &\text{otherwise} \end{array} \right.$$

(168)

The energy function and dynamics of this system are given as

$$\displaystyle\begin{array}{rcl} E_{\mathrm{SP}}(x(t),t)& =& C_{1}\left (\displaystyle\sum _{i=1}^{n}\displaystyle\sum _{ j=1}^{n}a_{ i}c_{j}\exp (-t/\tau )x_{ij}\right ) \\ & & +\displaystyle\sum _{i=1}^{n}\frac{C_{2}^{i}} {2}{ \left (\displaystyle\sum _{j=1}^{n}x_{ ij} - 1\right )}^{2} \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} & & +\displaystyle\sum _{j=1}^{n}\frac{C_{2}^{j}} {2}{ \left (\displaystyle\sum _{i=1}^{n}x_{ ij} - 1\right )}^{2}\end{array}$$

(169)

$$\displaystyle\begin{array}{rcl} \frac{d\mathcal{I}_{ij}(t)} {dt} & =& -C_{2}^{j}\displaystyle\sum _{ k=1}^{n}v_{ kj}(t) - C_{2}^{i}\displaystyle\sum _{ k=1}^{n}v_{ ik}(t) \\ \end{array}$$

$$\displaystyle\begin{array}{rcl} & & +C_{2}^{j} + C_{ 2}^{i} - C_{ 1}a_{i}c_{j}\exp (-t/\tau )\end{array}$$

(170)

$$x_{ij}(t) = f_{ij}\left (\mathcal{I}_{ij}(t)\right )$$

(171)

From the above equations, the weights of H-T’s model can be obtained as

$$w_{ij} = -\delta _{ij}(C_{2}^{j} + C_{ 2}^{i}) - (1 -\delta _{ ij})C_{2}^{i}\quad \mathrm{and}\quad \theta _{ ij} = C_{2}^{j} + C_{ 2}^{i}$$

(172)

Thus, with the above mapping, the H-T’s model is complete. The above system is solved by using Algorithm 2 to get the solution of SPP.

6.6 Traveling Salesman Problems (TSP)

Problem Definition: Given n cities with the Euclidean distance between them, d _ij. A path that starts from a given city, passing through all the cities (without visiting a city twice), and returning to the first city is called a tour. The TSP problem can be defined as finding the shortest tour for the given cities.

Solution Methodology: In Sect. 2, a detailed mapping of TSP onto H-T’s model was shown. However, in this subsection, one of the improved TSP mappings [119] will be presented. This is the modification of H-T’s approach, which can be described as

$$E_{\mathrm{tsp}}^{\mathrm{mod}} = E_{\mathrm{ tsp}} + E_{\mathrm{cns}} + E_{\mathrm{drv}}$$

(173)

where $E_{\mathrm{tsp}}$ is same as defined in Eq. (33),

$$\displaystyle\begin{array}{rcl} E_{\mathrm{cns}} = \frac{C1} {2} \displaystyle\sum _{i}{\left (\displaystyle\sum _{j}x_{ij} - 1\right )}^{2} + \frac{C2} {2} \displaystyle\sum _{j}{\left (\displaystyle\sum _{i}x_{ij} - 1\right )}^{2}& &\end{array}$$

(174)

$$\displaystyle\begin{array}{rcl} E_{\mathrm{drv}} = \frac{C3} {2} \displaystyle\sum _{i}\displaystyle\sum _{j}x_{ij}(1 - x_{ij}) + \frac{C4} {2} \left (\frac{{N}^{2}} {4} -\displaystyle\sum _{j}\displaystyle\sum _{i}{\left (x_{ij} -\frac{1} {2}\right )}^{2}\right )& &\end{array}$$

(175)

and $C1,\;C2,\;C3$, and C4 are new penalty parameters that are used for scaling the respective penalty terms. $E_{\mathrm{cns}}$ represents a better way of mapping the TSP constraints, whereas $E_{\mathrm{drv}}$ represents the penalty that deflects the external state of the neurons to take the value of 0 or 1.

There are numerous COPs in the literature, and presenting ANNs model for each of them will be a cumbersome task. However, a similar approach (like the above models) can be followed for any discrete optimization problem. Moreover, the above mappings were based on H-T’s model with penalty approach. As seen in Sect. 3, the capability of ANNs to solve COPs can be extended by using the Lagrange approach. Thus, any discrete optimization problem (linear or nonlinear) can be modeled using the techniques given in Sect. 5. In the following part of this section, a survey of some of the recent models in ANNs is cited.

A recent survey that presents the usage of ANNs in mathematical programming problems is conducted in [128]. Apart from the theoretical COPs, there has been many practical problems that have been mapped onto ANNs. For example, some of the recent applications of ANNs within the past 2 years are

Resource allocation in wireless communication [17, 78]
Scheduling in wireless sensors [105]
Scheduling in packet radio networks [113]
Scheduling via H-T’s model [31]
Scheduling in water pumps [49]
Scheduling to minimize earliness and tardiness [8]
Small-world nature of H-T’s model [141]
Constrained least absolute deviation problem [60]
Hybrid approach with genetic algorithms [5]
Hybrid approach with particle swarm optimization [33]
Hybrid approach with ant colony optimization [129]
Parameter setting for the GCP [118]
Placement of electronic circuit problem [36]
Edge detection in wood logs [95]
Economic load dispatching problem [48]
Hybrid routing algorithm [11]
Data envelopment analysis [59]
Water system management problem [120]
Graphs isomorphism discernment problem [140]
Nonlinear vector encoding problem [41]

7 Criticism

After the eye-catching implementation of applying ANNs in solving TSP, H-T’s model caught the attention of many researchers [92]. However, this method had two drawbacks at that time. It has a lot of parameters to select, and it performs a gradient descent method while solving COPs. In [132], it was illustrated that H-T’s model does not perform well. In fact, the results in [132] raised serious questions about the validity of this method. Thus, it drifted many researchers away from the use of ANNs in solving COPs. However, some of the researchers were inclined to improve the method proposed by Hopfield and Tank. One of the direction of improvements was to develop a method of optimally selecting the parameters in Hopfield and Tank’s energy function [37, 52, 65, 74, 117]. Recently, in [119], a systematic way of selecting optimal parameters for various TSP is presented. Apart from that, an enhanced method of mapping TSP onto an ANN is illustrated, and convergence and stability analysis is presented. From the results shown in [119], it can be concluded that by proper construction of energy function and by proper selection of parameters (penalty parameters), there can be a significant difference in the performance of ANNs while solving COPs.

Another direction of improvement in Hopfield and Tank’s ANN was to free the algorithm from gradient descend method, that is, to avoid falling into the local minima. One of the methods, that was adopted by many researchers, was to modify Hopfield and Tank’s objective function [4, 16]. However, some of the successful methods were to use hybrid techniques like convexification, improve choice of penalty function, and use hill-climbing techniques [123]. In [109], a novel approach that combines self-organizing neural networks and H-T’s model is presented. Convergence of such networks is analyzed. The algorithm was tested on car-sequencing problem. From the results, it was concluded that hybrid method of ANN is better than MINOS^{Footnote 14} solver. The results from this study showed that proper application of ANNs to solve COPs will prove better than the conventional solution methods of COPs.

Moreover, there were many stochastic and metaheuristic methods that were incorporated in H-T’s model. Some of the studies showed that global convergence can be obtained with ANNs if they are properly coupled with metaheuristic methods like simulated annealing [91]. Apart from that, sophisticated penalty methods and Lagrange methods made the use of ANNs more general (i.e., not only applied to linear and quadratic but can be applied to any nonlinear optimization problems). Among the use of penalty methods, in [6, 42], subspace and hyperplane approaches were used. These were the pioneer approaches that modified the use of penalty function, by incorporating the feasibility issues of the system into a single penalty parameter. In [43], the capability of ANNs were demonstrated based on polyhedral combinatorics. It was one of the novel methods in ANNs that proposed the use of memory (like tabu search) to overcome the entrapment of algorithm at the polyhedral vertices. However, not many studies have actually tried to compare performance of ANNs with other conventional or heuristic methods for solving COPs. Thus, it is very hard to conclude if ANNs perform better or worse than other methods. Apart from that, most of the implementations were simulations of ANNs on the digital computers. Thus, the computing limitations of digital computer may hinder the performance of ANNs. Nevertheless, it should be noted that ANNs can overcome the limitations of digital computers since they can be implemented on the analog circuits.

8 Conclusion

In this chapter, a simple and brief introduction of ANNs is presented, which is followed by addressing the usage of ANNs in solving COPs. Our focus throughout this chapter was not only to highlight the application of ANNs in solving COPs but also to present the theoretical aspects of ANNs in solving COPs. At the beginning of this chapter, some of the classical ANN models that were used in solving COPs were illustrated. Along with these illustration, a methodology of implementing an LP to analog circuit is depicted. Moreover, a detail method of mapping TSP on one of the classical ANNs is presented. In addition to that, some of the stability and convergence analysis of ANNs is studied. Lastly, examples from literature on solving general optimization problems and discrete optimization problems using ANNs were presented.

It can be seen that unlike other metaheuristics, based on appropriate conditions, ANNs do converge to local and/or global optima. Simple methods that are used to define such optimality conditions were shown with proofs in Sect. 3 of this chapter. The aim in presenting these proofs was to highlight the importance of selecting proper energy function and the type of dynamics (i.e., consecutive, parallel, or synchronous). Apart from that, methods to extend the classical models (which can solve only linear or quadratic problems) by using penalty functions, Lagrange functions, or hybrid approaches were presented. Through various illustration, it was shown that a continuous model can be used to solve discrete problem very easily. It was underlined that the only change in solving a discrete problem and continuous problem is the adjustment in the transfer function and the adjustment in the energy function for the discrete variables.

Since the first implementation of ANN in 1980s until the present, there are handful analog implementations of ANNs in solving COPs. Due to the limitations of performance comparison of ANNs with other solution procedures, a concrete conclusion regarding the advantage or disadvantage of ANNs in solving COPs cannot be presented at this point. However, proper implementation of the analog circuit and systematic study in deployment of these circuits to solve real world problems will demonstrate the true potential of ANN.

9 Cross-References

Binary Unconstrained Quadratic Optimization Problem

Graph Searching and Related Problems

Quadratic Assignment Problems

Notes

1.
A term similar to the smallest processing unit in the brain, in ANNs it consist of a small independent circuit with resistors, diodes, capacitors etc.
2.
Biological neurons are the basic building blocks of the nervous system.
3.
A simple single-layered artificial neural network invented by Frank Rosenblatt.
4.
Feedback and feed-forward are two main classes of neural networks and will be explained in the following subsections.
5.
Some authors prefer to use the values between [ − 1,1]; however, both the definition are interchangeable and have the same convergence behavior. Hence, in this work [0,1], representation is followed.
6.
Whether to set the value of the corresponding external state to 0 or 1
7.
The term synaptic is related to the nervous system and is used in ANNs to indicate the weight between any two neurons.
8.
This function may be continuous or discontinuous based on the type of the neuron. That is, for a discrete case, this function is similar to a signum function, whereas for continuous case, this function is similar to a sigmoidal function.
9.
$\delta _{ik} = 0$ if $i\neq k$; $\delta _{ik} = 1$ if i = k.
10.
They can be selected randomly; however, the best way to select them is considering an independent set of neurons. This will be discussed in the following subsections.
11.
There are many ways to select the k based on the distribution function defined by:
- The gradient rule [50]
- The probabilistic-gradient rule
- The Hopfield rule [56]
However, these distribution functions are independent and have the same mean value (n) [83]. Thus, they convey the idea of random selection of the $K\mathrm{th}$ neuron.
12.
Using Kirchoffs law, see [77]
13.
neuron ij corresponds to variable x _ij
14.
A standard nonlinear programming problem solver

Author information

Authors and Affiliations

Industrial and Systems Engineering, University of Florida, 303 Weil Hall, 32611, Gainesville, FL, USA
Mujahid N. Syed
Department of Industrial and Systems Engineering, University of Florida, 303 Weil Hall, 32611, Gainesville, FL, USA
Panos M. Pardalos

Authors

Mujahid N. Syed
View author publications
You can also search for this author in PubMed Google Scholar
Panos M. Pardalos
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Mujahid N. Syed .

Editor information

Editors and Affiliations

Department of Industrial and Systems Eng, University of Florida, Gainesville, Florida, USA
Panos M. Pardalos
Department of Computer Science, University of Texas, Dallas, Richardson, Texas, USA
Ding-Zhu Du
Dept. Comp. Sci. & Engineering, University of California, San Diego, La Jolla, California, USA
Ronald L. Graham

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Syed, M.N., Pardalos, P.M. (2013). Neural Network Models in Combinatorial Optimization. In: Pardalos, P., Du, DZ., Graham, R. (eds) Handbook of Combinatorial Optimization. Springer, New York, NY. https://doi.org/10.1007/978-1-4419-7997-1_65

Download citation

DOI: https://doi.org/10.1007/978-1-4419-7997-1_65
Published: 26 July 2013
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4419-7996-4
Online ISBN: 978-1-4419-7997-1
eBook Packages: Mathematics and StatisticsReference Module Computer Science and Engineering

Publish with us

Policies and ethics

Neural Network Models in Combinatorial Optimization

Abstract

Similar content being viewed by others

Deterministic Global Optimization with Artificial Neural Networks Embedded

The Convergence of Incremental Neural Networks

Local Search is a Remarkably Strong Baseline for Neural Architecture Search

Keywords

1 Introduction

1.1 Objective

1.2 Outline

2 Review

2.1 Artificial Neural Networks (ANNs)

Example 1

2.2 Example: Implementation

2.3 Hopfield and Tank (H-T) Models

2.4 Durbin-Willshaw (D-W)’s Model

2.5 Kohonen Model

2.6 Lagrangian Model

2.7 General Methodology

2.7.1 Guidelines for H-T’s Model (Discrete and Continuous)

2.7.2 Guidelines for Durbin-Willshaw’s Model

2.7.3 Guidelines for Kohonen’s Model

2.7.4 Guidelines for Lagrangian Model

2.8 Example: Mapping

2.8.1 Hopfield Mapping

3 Optimality Conditions

3.1 System Dynamics

3.2 Lyapunov Energy Function

Definition 1

Definition 2

Definition 3

Definition 4

Theorem 1

Proof

Theorem 2

Proof

Theorem 3

Proof

Theorem 4

Proof

Theorem 5

Theorem 6

Theorem 7

Theorem 8

3.3 Penalty-Based Energy Functions

Theorem 9

Theorem 10

3.4 Lagrange Energy Functions

Theorem 11

Theorem 12

Theorem 13

Corollary 1

Corollary 2

Theorem 14

Theorem 15

4 Escaping Local Minima

4.1 Stochastic Extensions

4.1.1 Boltzmann Machines

4.1.2 Gaussian Machines

4.1.3 Cauchy Machines

4.2 Chaotic Extensions

4.2.1 Chen and Ahira’s Model

4.2.2 Wang and Smith’s Model

4.2.3 Chaotic Noise Model

4.3 Convexification

4.4 Hybridiziation

5 General Optimization Problems

5.1 Linear Programming

5.1.1 Kennedy and Chua’s Model

5.1.2 Rodriquez-Vazquez et al. Model

5.1.3 Zak et al. Model

5.2 Convex Programming

5.3 Quadratic Programming

5.4 Nonlinear Programming

5.5 Complementarity Problem

5.5.1 Linear Complementarity Problem (LCP)

5.5.2 Nonlinear Complementarity Problem (NCP)

5.6 Mixed Integer Programming Problems (MIPs)

5.6.1 Penalty Function Approach

5.6.2 Lagrange Function Approach