Keywords

1 Motivation and Related Work

A heuristic for an optimization problem is an algorithm that explores not all possible states of the problem, but rather the most likely ones. Purely heuristics-based solutions may be inconsistent, therefore, metaheuristics are used that usually perform better than simple heuristics [14]. A metaheuristic is a generic algorithmic template that can find high-quality solutions of optimization problems [4] exploiting a trade-off of local search and global exploration. Metaheuristics find good-quality solutions for optimization problems in a reasonable amount of time, but there is no guarantee that the optimal solution is always reached [26].

In this paper, we consider a challenging area of optimisation – optimal design of multiproduct batch plants, e.g., in the chemical industry for producing pharmaceuticals, polymers, food etc. There has been an active research on efficiently solving such and similar problems. The classical n-queens problem was addressed using the Ant Colony Optimization (ACO) [17, 24] and its combination with a Genetic Algorithm (GA) [2]. Paper [21] solves process engineering problems by the Differential Evolution (DE) algorithm and demonstrates its advantages over the exact optimization by Branch-and-Bound (B&B) and using a GA. In [13], a particle swarm algorithm and a GA are exploited for multiproduct batch plant design. Paper [10] develops a multiobjective GA which demonstrates high flexibility and adaptability for various engineering problems. The problem of the optimal design of batch plants with imprecise demands on product amounts is addressed in [3] by integrating an analytic hierarchy process strategy for the analysis of the GA Pareto-optimal solutions. Paper [19] uses ACO and SA to solve a stochastic facility layout problem in which product demands are normally distributed random variables.

In order to reduce the run time of metaheuristics-based approaches, their implementation on different parallel architectures has been studied. In particular, Graphics Processing Units (GPU) are widely used by employing the CUDA platform [20]. GPU were used for solving the classical TSP problem by simulated annealing [27] and by an ant system [8]. The problem of scheduling transit stop inspection and maintenance was studied by using Harmony Search and ACO [16], with alternative implementations on CPU and GPU.

Our contribution in this paper is two-fold: (1) we develop a novel, hybrid approach which combines two metaheuristics – Ant Colony Optimization (ACO) [11] and Simulated Annealing (SA) [18], and (2) we implement it on a CPU-GPU system using CUDA and we show that it is preferable to the Branch-and-Bound approach used in our previous work [7]. Section 2 describes the mathematical problem formulation, Sect. 3 – the methodology of our hybrid ACO+SA approach, and Sect. 4 – its parallelization. Section 5 reports our experimental results, and Sect. 6 concludes the paper.

2 Problem Formulation

Our application use case is optimizing a Chemical-Engineering System (CES) – a set of equipment (tanks, filters, dryers etc.) which manufacture some products. A CES consists of a sequence of I processing stages; i-th stage is equipped with equipment units from a finite set \(X_i\), with \(J_i\) being the number of equipment units variants in \(X_i\). All equipment unit variants of a CES are described as \(X_i=\{x_{i,j}\}, i=\overline{1,I}, j=\overline{1,J_i}\), where \(x_{i,j}\) is the main size j (working volume, working surface) of the unit suitable for processing stage i. A CES variant \(\varOmega _e, e=\overline{1,E}\) (where \(E=\prod _{i=1}^I J_i\) is the number of all possible variants) is an ordered set of available equipment unit variants. The goal is finding the optimal number of units at processing stages and their sizes while the input data are: demand for each product of assortment, production horizon, available equipment set, etc. Each variant \(\varOmega _e\) of a system must be in an operable condition (compatibility constraint), i.e., it must satisfy the condition of a joint action for its processing stages expressed by function S: \(S(\varOmega _e)=0\) if the compatibility constraint is satisfied. An operable variant of a CES must also satisfy a processing time constraint: \(T(\varOmega _e) \le T_{max}\), where \(T_{max}\) is the total available time (horizon).

Thus, designing an optimal CES is formulated as follows [5, 6]: find a variant \(\varOmega ^*\in \varOmega _e, e=\overline{1,E}\) of a CES, that minimizes the objective function – equipment costs \(Cost(\varOmega _e)\), and both compatibility and processing time constraint are satisfied:

$$\begin{aligned} \varOmega ^*=\textit{argmin}\ \textit{Cost}(\varOmega _e), e=\overline{1,E} \end{aligned}$$
(1)
$$\begin{aligned} \varOmega _e = \{x_{1,j_1},x_{2,j_2},\ldots , x_{I,j_I}|j_i=\overline{1,J_i},i=\overline{1,I}\}, e=\overline{1,E} \end{aligned}$$
(2)
$$\begin{aligned} x_{i,j}\in {X_i},\,i=\overline{1,I}, j=\overline{1,J_i} \end{aligned}$$
(3)
$$\begin{aligned} S(\varOmega _e)=0, e=\overline{1,E} \end{aligned}$$
(4)
$$\begin{aligned} T(\varOmega _e) \le T_{max}, e=\overline{1,E} \end{aligned}$$
(5)

The search space can be represented as a tree of height I (Fig. 1). Each tree level corresponds to one processing stage of the CES, each edge corresponds to a selected equipment variant taken from the set of possible variants \(X_i\) at stage i. Each node \(n_{i,k}\) at the tree layer \(N_i=\{ n_{i,1},n_{i,2},\ldots , n_{i,k}\},\,i=\overline{1,I}, k=\overline{1,K_i}, K_i=\prod _{l=1}^i(J_l)\) corresponds to a variant of equipment units for stages 1 to i.

Fig. 1.
figure 1

The search tree for a CES with 4 stages.

Figure 1 shows an example CES consisting of 4 stages (\(I = 4\)), where each stage can be equipped with 2 devices (\(J_1 = J_2 = J_3 = J_4 = 2\)), i.e., the number of all possible system variants is \(2^4=16\).

3 Hybrid Metaheuristic Approach

Our approach to the optimal design of multi-product batch plants is based on two metaheuristics: Simulated Annealing (SA) and Ant Colony Optimization (ACO). SA is widely used for solving optimization problems [18, 26]; its key advantage is escaping from local optima by allowing hill-climbing moves to find a global optimum. SA can deal with arbitrary systems and objective functions; it often finds an optimal solution and generally finds a good-quality solution.

Before searching for a solution using SA, we need a feasible initial solution. For classical optimization problems, e.g., Traveling Salesman Problem (TSP), it is possible to use a random initial solution. However, for our problem described in Sect. 2, random initialization is unacceptable, because the compatibility (4) and processing time (5) constraints must be satisfied. Our search for a feasible initial solution is a Constraint Satisfaction Problem (CSP) [23] without cost optimization, which consists in finding an operable variant of a CES, satisfying (4) and (5). For solving it, we use the Ant Colony Optimization (ACO) metaheuristic [15] that provides good-quality results in many applications, including CSP [24].

3.1 Ant Colony Optimization (ACO)

The Ant Colony Optimization (ACO) metaheuristic can be viewed as a multi-agent system in which agents (ants) interact with each other in order to reach a global goal [26]. It is inspired by the behaviour of ant colonies: while walking from food source to the nest and vice versa, ants deposit a chemical substance called pheromone on their path. Pheromone is used as a communication medium among ants and guides them to find the shortest path from the nest to food: ants follow, with some probability, the pheromone deposited by previous ants.

figure a

Listing 1 shows the pseudocode of the ACO algorithm for our problem. The number M of ants is the algorithm parameter which determines the trade-off between the number of iterations and the breadth of the search per iteration: the larger the number of ants per iteration, the fewer iterations are needed [25]. All ants behave in a similar way: each ant moves from the top of the tree in Fig. 1 to the bottom. Once an ant selected a node \(r=n_{i,j}\) at level i, it can pick the next child node \(s=n_{i+1,j}\). The tour of an ant ends at the last tree level I; each path corresponds to a potential solution of the problem. The ant transition from node r to s is probabilistically biased by two values: pheromone trail \(\tau _{rs}\) and heuristic information \(\eta _{rs}\), as follows: \(p_{rs} = {\tau _{rs}^\alpha \cdot \eta _{rs}^\beta } / {\sum _{k\in C_{r}} {(\tau _{rk}^\alpha \cdot \eta _{rk}^\beta )}}\), where \(C_{r}\) is the set of child nodes for r [12, 24, 26]. The factors \(\alpha \) and \(\beta \) influence the pheromone value and heuristic value respectively. These parameters control the relative importance of the pheromone trails and the heuristic information.

Our approach to calculating heuristic information is based on the fact that a CES with bigger units is usually more expensive, but it has a bigger batch size of products and so produces faster than a CES with smaller units. This is favourable for satisfying the time constraint (5). Therefore, we make a unit which satisfies the compatibility constraint (4) for the beginning part of the CES and a larger basic size more preferable than a unit with the unsatisfied compatibility constraint and smaller basic size. We use the following rule for the pheromone update (line 8): \(\tau _{rs} = \tau _{rs} + Q / \sum _{m=1}^M{L_m}\), where Q is some constant and \(L_m\) is the tour length of the m-th ant, M is the swarm size. The smaller is the value of \(L_m\) the larger is the value added to the previous pheromone value. We use \(L_m\) as a fitness value that indicates how close is a given solution to achieving the required goals. Listing 2 shows our approach to computing the fitness value L[m] (line 2). Function NumberS() (lines 4–8) counts the number of stages of the beginning part of the CES, composed of devices for stages 1 to i (lines 6–7), for which (4) is satisfied. We add 1 to NumberS() if constraint (5) is satisfied (line 2). Therefore, the minimal fitness value is 0 (no constraint is satisfied), the maximal value is \(I+1\) (all constraint are satisfied, the problem is solved). We use the maximal fitness value as constant Q: \(Q=I+1\). With time, the concentration of pheromone decreases due to evaporation. The evaporation (Listing 1 line 9) is performed at a constant rate after the completion of each iteration. It allows the ant colony to avoid an unlimited increase of the pheromone value and to forget poor choices made previously [25]. We implement this as follows: \(\tau _{rs} = \tau _{rs} \cdot \rho \), where \(\rho \in [0,1]\) is the trail persistence parameter.

figure b

3.2 Simulated Annealing (SA)

The basic idea of SA is to use random search which accepts not only changes that improve the objective function, but also some changes that are not ideal, in order to escape local minima. A parameter t called temperature governs the search behaviour.

figure c

Listing 3 shows a pseudocode of our SA version that performs two loops: the inner loop (line 4) to search for a neighbouring solution, and the outer loop (line 3) to decrease the temperature in order to reduce the probability of accepting the non-improving neighbouring solutions in the inner loop. W is a vector of length I, each element W[i] specifying the device variant at each stage of the problem solution (1)–(5). At each iteration of the inner loop, we generate a new candidate solution Wcand in the neighbourhood of the current feasible solution (line 5) using our procedure Perturb() (lines 17–20): we select a random stage (line 18) in the feasible solution, for which we select a random unit (line 19) from the equipment set accessible for this stage. Thus at each iteration we change only one unit in the feasible solution at a stage. We avoid getting trapped in a local optimum by randomly generating neighbours and accepting a solution that worsens the value of the objective function with certain probability [22] which depends on the change of the objective function \(\varDelta \mathcal {E}\) and parameter t: the acceptance probability p decreases over time as t decreases. Consequently, SA first performs a wide investigation of the solution space and then restricts the solution space gradually, converging to the best solution. We initialize W with an initial feasible solution Winit obtained as the result of ACO, and the temperature t with initial value Tinit (line 2). We choose Tinit as a difference between the cost of the most expensive and the cheapest CES variant, as recommended in [1]. The transition probability p (line 13) is determined by \(p=\exp (-\varDelta \mathcal {E} /(k_{\mathrm {B}}\cdot t ))\), where \(k_{\mathrm {B}}\) is the Boltzmann’s constant, \(\mathcal {E}\) is the change of the energy level [28]. We use \(k_{\mathrm {B}} = 1\) and \(\gamma = 1\) [28]. Thus, the probability becomes p = exp (-deltaCost / t) (line 13).

A finite-time implementation of SA is obtained by generating a sequence of homogeneous Markov chains of finite length \(L_{max}\) which depends on the size of the problem [1]. The iterations at a given value of t repeat Lmax times (line 4). We compute Lmax as the total size of equipment set, i.e., \(L_{max}=\sum _{i=1}^IJ_i\), where \(J_i\) is the number of equipment units variants for stage i. By permuting the feasible solution, we select at each iteration a random unit (line 19) in one random stage (line 18). Temperature t is decreased at the end of each iteration using a cooling schedule defined by an initial temperature Tinit, a rule for reducing t, and a final temperature Tfinal which is fixed at a small value chosen as the smallest possible difference in cost between two neighboring solutions; in our case, we use for Tfinal the price of the cheapest unit. We use (line 16) the fast cooling rule \(t = \sigma \cdot t\) [26], where \(0.8 \le \sigma \le 0.99\) as recommended in [1].

4 Parallelization for GPU

Figure 2 illustrates our parallel implementation of the hybrid (ACO+SA) approach described in Sect. 3 on a system comprising a CPU and a GPU.

Fig. 2.
figure 2

The hybrid algorithm structure.

Application code consists of a sequential code (host code for CPU) that invokes parallel execution of hundreds or thousands of threads on the device (GPU), where all threads execute the same kernel code. The implementation consists of the following five steps (from left to right in Fig. 2):

  1. 1.

    CPU reads the input data (number of CES stages I, number of accessible equipment set \(J_i\), production horizon \(T_{max}\) etc.) from a file, initializes the metaheuristics’ parameters for ACO and SA, sends this data to GPU, and starts on the GPU the kernel function for ACO.

  2. 2.

    The ACO kernel on the GPU searches for the first feasible solution – the initial CES-variant, as described is Sect. 3.1. We use the Multiple Ant Colonies approach [9]: all colonies work as threads in parallel to solve the problem independently. If some thread finds a solution then all threads terminate. With an increasing number of threads, the probability of finding a solution increases, and therefore the search time is typically reduced.

  3. 3.

    CPU receives the obtained solution, distributes it between threads as an initial solution for SA, and starts the SA kernel function on the GPU.

  4. 4.

    The SA kernel on the GPU searches in each thread for the optimal solution with the initial solution found by ACO, i.e., we do not try to reduce the time of one iteration, but rather increase the number of iterations executed simultaneously. Each thread executes an independent instance of SA, thus, the chance of the algorithm to converge to the global optimum increases, even if all instances use the same initial solution. A larger number of threads does not reduce the run time of the algorithm, but rather increases the probability that some thread eventually finds a nearly optimal solution.

  5. 5.

    CPU receives the SA solutions obtained by the GPU threads and chooses the best among them – this is the final solution of our problem.

Host Code. The host starts its work by loading the input data from a file. The number of threads is a program launch parameter taken as a command-line argument. The host sends data to the GPU and starts the kernel ACO() that implements the ACO-algorithm. A CUDA kernel launch is asynchronous, i.e. it returns control to the CPU immediately after starting the kernel. Using cudaDeviceSynchronize(), the CPU waits until the GPU terminates and receives the results from it.

Kernel Code for ACO. Listing 4 shows our parallel implementation of ACO, where each thread simulates the work of one ant colony. For all threads, initially, all edges are assigned small random pheromone values from interval [0, 1] (lines 4–5). The global flag isFound and the local iteration counter iterCounter are used to control threads. The flag is changed by a thread using atomicAdd() if this thread has found a feasible solution (line 21). The local iteration counter is used by each thread as a nonstop operation protection: if ants in this thread cannot find the solution after maxIterNumber iterations (which is possible for stochastic algorithms) then the thread terminates. After initialization, each ant m in swarm M generates a path (lines 9–17). Here, Want is a local two-dimensional array of length M, each element of which is a vector of length I specifying the device variant at each stage of the solution.

We do not discuss the kernel code of SA – it largely follows Listing 3.

5 Experimental Results

Our experiments are conducted on a heterogeneous system comprising: (1) a CPU: Intel Xeon E5-1620 v2, 4 cores with Hyper-Threading, 3.7 GHz with 16 GB RAM, and (2) a GPU: NVIDIA Tesla K20c with altogether 2496 CUDA cores and 5 GB of global memory. We use Ubuntu 16.04.2, NVIDIA Driver version 367.57, CUDA version 8.0 and GNU C++ Compiler version 5.4.0.

figure d

As our test case, we evaluate the design of a CES consisting of 16 processing stages with 2 to 12 variants of devices at every stage (total \(2^{16}\) to \(12^{16}\) CES variants). In our previous work [6, 7], we used the Branch-and-Bound (B&B) algorithm to find the global optimal solution. Here we solve the same problem on the same test system using our hybrid metaheuristic approach (ACO+SA), and we compare the results with the solution obtained by B&B. Since both SA and ACO are probability-based algorithms, their results will be different if run multiple times on the same instance of a problem; therefore, we run each instance for 100 times and we take the average of the measured values.

Fig. 3.
figure 3

Run time of (ACO+SA) depending on threads number.

Fig. 4.
figure 4

Deviation of the found solution from global optimum.

Figure 3 shows how the run time of the (ACO+SA) parallel program depends on the number of threads. We run our CUDA-based implementation with the number of threads from 100 to 2500 with step 100, for the CES example of 16 processing stages with 10 variants of units. We observe that the run time is decreasing with the increasing number of threads. While on 100 threads, the ACO takes \(91\%\) of the total run time, the portion of ACO decreases to only about \(10\%\) on more than 2000 threads. This is because, with more threads, ACO finds a solution faster with a higher probability, whereas using more threads for SA can improve the quality of solution, but not the speed.

Figure 4 shows the deviation of the observed objective value (CES cost) by our hybrid algorithm, calculated as \({|observed - expected|}/{expected} \cdot 100\%\). As the expected value we use the global optimal value obtained by the B&B algorithm. On 100 threads, the deviation is about \(0.38\%\) and then it decreases to almost \(0\%\) for more than 2000 threads, so we achieve an almost optimal solution.

Fig. 5.
figure 5

Deviation (logarithmic scale) for different problem sizes.

Fig. 6.
figure 6

Run time (logarithmic scale) depending on problem sizes.

Figure 5 shows the deviation (vertical axis has a logarithmic scale) of solutions obtained by our hybrid (ACO+SA) algorithm (sequential on CPU, parallel on GPU using up to 2500 threads) from the global optimum obtained by sequential B&B. We observe that our parallel hybrid algorithm produces a nearly global optimal solution (for problem size \(2^{16}\)\(9^{16}\) the deviation is \(0\%\), and for problem size \(10^{16}\)\(12^{16}\) the deviation is less than \(0.01\%\)). The deviation obtained by our sequential algorithm is less than \(1\%\) for small size problem \(2^{16}\), but it increases to about \(14\%\) for the problem size of \(12^{16}\). This is because the sequential implementation performs only a single run of the SA: for small problem sizes, the probability of finding a good solution is higher than for larger problem sizes.

Figure 6 shows the program run time (vertical axis has a logarithmic scale) of our hybrid approach vs. B&B depending on the problem size. The program run time of B&B increases exponentially: while for problem size \(2^{16}\) the run time is less than 1 s, for size \(12^{16}\) the run time of B&B becomes prohibitively long at about 134 h. The run time of the sequential implementation of our hybrid algorithm is less than 0.1 s for the smallest problem, and it increases to about 580 sec for our maximal problem size. The run time of the parallel implementation smoothly increases from 5 to 227 sec. The parallel implementation is slower than the sequential implementation for smaller problem sizes (\(2^{16}\) to \(9^{16}\)), but for larger problem sizes (\(10^{16}\) to \(12^{16}\)) it is faster than the sequential version by about 2.6–2.8 times. At first glance, the speedup of 2.7 times compared to the sequential case is small, but the quality of the solutions obtained by the parallel implementation is significantly higher: the deviation from the global optimum for the parallel implementation is less than \(0.01\%\) against more than \(10\%\) deviation for the sequential version. This good quality of solutions is achieved by the independent runs of SA: for a larger number of threads the probability that one of the threads finds a nearly optimal solution is higher, because running a parallel algorithm on 2500 threads is equivalent to the launch of sequential algorithm 2500 times and the choice of the best among the found solutions. The parameters of metaheuristic algorithms influence both the run time and the quality of solutions. Empirically we have found that \(\alpha =0.4\) and \(\beta =0.6\) with the colony size \(M=100\) are good values for our application: they were selected after numerous experiments. In our experiments, we use \(\rho =0.9\) and the cooling rule constant \(\sigma = 0.9\), also chosen empirically for our problem.

6 Conclusion

Our contribution is the novel hybrid (ACO+SA) metaheuristic approach to solving the optimization problem for multiproduct batch plants design and its parallel implementation on a CPU-GPU platform. We have found out that increasing the number of threads accelerates finding the solution with ACO and increases the reliability and quality of the solutions obtained by SA. We compare our results with the global optimal solution obtained by the B&B method. Our experiments confirm that our parallel hybrid approach obtains good-quality solutions which are very near to the global optimal values obtained by a deterministic algorithm like B&B, but our approach finds the solution much faster.