1 Introduction

We systematically compare two main classes of optimization approaches to the design of multiproduct batch plants which are widely used, e.g., in the chemical industry for producing pharmaceuticals, polymers, food, etc.—deterministic optimization like branch-and-bound (B&B) and metaheuristics. We aim at improving the state of the art: while previous work describes particular solutions and shows that metaheuristics often find solutions of good quality in a reasonable amount of time [18], a systematic comparison of different approaches is lacking. There has been much research on solving similar optimization problems and their parallelization; because of lack of space, we refer to our detailed survey of related work in [3]. In particular, graphics processing units (GPU) have been widely used by employing the compute unified device architecture (CUDA) platform [14]. GPU were used for solving the classical TSP problem by simulated annealing [19] and an ant system [6]. The problem of scheduling transit stop inspection and maintenance was studied by using harmony search and ant colony optimization [11], with implementations on CPU and GPU.

There are two new contributions in this paper. First, we combine two metaheuristics—ant colony optimization (ACO) [8] and simulated annealing (SA) [13] into a novel, hybrid approach. Second, we implement our approach on a GPU-based system using CUDA and we demonstrate that it is preferable to the GPU-parallelized branch-and-bound (B&B) approach [5] when both run time and the quality of solution are taken into account.

In Sect. 2, we formulate the problem, in Sect. 3 we introduce our hybrid ACO+SA approach, and in Sect. 4 its parallelization on GPU using CUDA. Section 5 describes the experimental comparison of our hybrid approach versus B&B, and in Sect. 6 we conclude the paper.

2 Problem: optimization of CES

The problem studied in this paper consists in optimizing the design of a chemical-engineering system (CES), which is a set of equipment (reactors, tanks, filters, etc.) for manufacturing diverse products. Assuming that the number of units at every stage of CES is fixed, the problem can be formulated as follows (for a detailed mathematical formulation, see [3]). A CES works in I processing stages; ith stage can be equipped with equipment units from a finite set \(X_i\), with \(J_i\) being the number of equipment units variants in \(X_i\). The equipment variants of a CES are described as \(X_i=\{x_{i,j}\}, i=\overline{1,I}, j=\overline{1,J_i}\), where \(x_{i,j}\) is the main size j (working volume, working surface) of the unit suitable for processing stage i. A CES-variant \(\Omega _e, e=\overline{1,E}\) (where \(E=\prod _{i=1}^I J_i\) is the number of all possible system variants) is an ordered set of available equipment unit variants.

Fig. 1
figure 1

Example: a simple chemical-engineering system (CES)

Figure 1 shows an example of a simple CES: it consists of 4 stages (\(I = 4\)); each stage can be equipped with 2 devices (\(J_1 = J_2 = J_3 = J_4 = 2\)); therefore, the number of possible system variants in this case is \(2^4=16\).

The goal is to find the optimal number of units at stages and their main sizes; the input data are: production horizon, demand for each product, available equipment, etc. Each system’s variant \(\Omega _e\) has to be in an operable condition (compatibility constraint) expressed by function S: \(S(\Omega _e)=0\). If \(T_{max}\) is the total available time (horizon), then an operable variant of a CES must also satisfy a processing time constraint: \(T(\Omega _e) \le T_{max}\).

We represent the search space of variants as a tree of height I as explained in detail in [3]. Tree levels correspond to processing stages of the CES, and edges correspond to selected equipment variants: for example, the edges from level 0 correspond to elements of \(X_1\). A node \(n_{i,k}\) at the tree layer \(N_i=\{ n_{i,1},n_{i,2},\ldots , n_{i,k}\},\,i=\overline{1,I}, k=\overline{1,K_i}, K_i=\prod _{l=1}^i(J_l)\) corresponds to a variant of a beginning part of the CES, composed of equipment units for stages 1 to i. Each path from the root to one of the leaves corresponds to the construction of a candidate solution; inner nodes correspond to partial solutions. The move from an internal node to one of its children is an extension of the corresponding partial solution, also called a solution construction step.

3 Our approach: combining metaheuristics

Our approach combines two metaheuristics: simulated annealing (SA) and ant colony optimization (ACO). The key feature of SA is avoiding local optima by means of hill climbing towards a global optimum. We combine SA with another metaheuristic, because it needs a feasible initial solution. We cannot use random initialization for our problem, since the compatibility and processing time constraints must be satisfied.

For finding a feasible initial solution, we use the Ant Colony Optimization (ACO) metaheuristic [10] that provides good-quality results in many applications [16]. Note that our search for a feasible solution is a Constraint Satisfaction Problem (CSP) [15] without cost optimization.

Metaheuristic algorithms have several parameters which allow the user to adapt the behaviour of the algorithm. While finding the best parameter values is a non-trivial task [2], there exist some established recommendations for choosing these parameters, which we discuss for each part of our hybrid approach—ACO and SA—in Sects. 3.1 and 3.2, correspondingly.

3.1 Ant colony optimization (ACO)

The ant colony optimization (ACO) metaheuristic is inspired by the behaviour of ant colonies: when moving from the nest to food and vice versa, ants leave on the path a chemical substance (pheromone), which they use for communication and for finding the shortest path, because with some probability, ants follow the pheromone left by other ants before.

In Listing 1, we show the ACO algorithm for the CES problem. An important algorithm parameter is M—the number of ants: with a larger size of colony, fewer iterations are required [17].

figure a

In the algorithm, the ants move from the top to the bottom of the search tree. After selecting a node \(r=n_{i,j}\) at tree level i, the ant picks the next child node \(s=n_{i+1,j}\). An ant’s tour ends at the bottom level I, such that a path corresponds to a possible problem solution. The heuristic information allows to increase the probability of the “correct way” for each ant: the movement from node r to s is probabilistically biased by the pheromone trail \(\tau _{rs}\) and the heuristic information \(\eta _{rs}\) as follows: \(p_{rs} = {\tau _{rs}^\alpha \cdot \eta _{rs}^\beta } / {\sum _{k\in C_{r}} {(\tau _{rk}^\alpha \cdot \eta _{rk}^\beta )}}\), where \(C_{r}\) is the set of children for r [9, 16, 18]. Parameters \(\alpha \) and \(\beta \) determine the relative impact of the heuristic information and pheromone on the trails.

When computing the heuristic information, we take into account that usually a CES with bigger units is more expensive; however, it works faster than a CES with smaller units as it has a bigger batch size of products. Therefore, we assume that a unit which satisfies the processing time constraint for the beginning part of the CES and larger main size x is preferable to a unit with a smaller main size but unsatisfied compatibility constraint.

The pheromone level is modified in two complementary steps: update and evaporation. For the update (line 8), we use the following rule : \(\tau _{rs} = \tau _{rs} + Q / \sum _{m=1}^M{L_m}\), where \(L_m\) is the tour length of the mth ant and Q is a constant. Note that for a smaller value of \(L_m\) the value added to the previous pheromone level is larger. Here, \(L_m\) is a fitness value, i.e., how close is a given solution to the optimum. A fitness function f is used to estimate how close a given solution is to achieving the good set. For finding a first feasible CES variant, we use the fitness value from [12] adapted for our problem.

In the line 12 of Listing 1, we compute the fitness value L[m]. Function NumberS() (lines 14–18) calculates stage number of beginning part of the CES, consisting of devices for stages 1 to i of the CES (lines 16–17), for which the compatibility constraint is satisfied. A feasible variant of a CES should satisfy the conditions of a joint action for all its processing stages, and satisfy the restrictions for the duration of its operating period. We add 1 to NumberS() if the processing time constraint is satisfied (line 12). The minimal fitness value is 0 when no constraint is satisfied, and the maximal value is \(Q=I+1\) when all constraint are satisfied, and the problem is solved. Using fitness values, we compare ant paths. For example, assume that there are two ant paths: first with fitness value \(L_1=4\) and second with fitness value \(L_2=2\). Both ant paths do not solve our problem (for solving the fitness value must be 5). But the path with \(L_1=4\) is better than the path with \(L_2=2\), and the first ant increases the pheromone value stronger than the second.

Evaporation in the line 9 of Listing 1 is applied at a constant rate after each iteration, in order to decrease the concentration of pheromone. Thereby, our algorithm can forget poor choices made previously [17], and this ensures that old pheromone does not have a too strong influence on the future; we do this by using the trail persistence parameter \(\rho \in [0,1]\) in the formula \(\tau _{rs} = \tau _{rs} \cdot \rho \).

3.2 Simulated annealing (SA)

Metaphorically speaking, SA is similar to dropping some bouncing balls over a landscape, and as the balls lose energy, they settle down to some local minima. If the balls are allowed to bounce long enough and lose energy slowly enough, some of them will eventually fall into the globally lowest location, and hence the global minimum will be reached. A parameter called “temperature” (in analogy to the physical annealing process) controls the search behaviour: SA probabilistically accepts a candidate solution of worse quality than the current solution in order to avoid local minima.

The idea of SA is to perform a wide investigation of the solution space at the beginning and then restrict the space gradually, thus converging to the best solution. In Listing 2, our SA implementation consists of two loops. Here, W[i] specifies the device variant at stage i; W is initialized with an initial solution Winit obtained by ACO, the initial value Tinit for temperature is a difference between the cost of the most expensive and the cheapest CES variant [1].

figure b

The outer loop (line 3) decreases the temperature, thereby reducing the probability of non-improving solutions in the inner loop (line 4). The inner loop searches for a neighbouring solution (line 5). We use our procedure Perturb() (Listing 3) to generate a new candidate solution Wcand as follows: for a random stage in the feasible solution, we take a random unit from the equipment set possible at this stage. We avoid a local optimum by generating neighbours randomly and by allowing a solution that worsens the objective function with a probability depending on the change of the objective function \(\Delta \mathcal {E}\) and the temperature parameter t; this probability p decreases over time as t decreases.

figure c

Our objective function f is the same as in Sect. 2. At each iteration, a neighbouring solution is computed by procedure Perturb() (line 5). To generate a new solution, we take randomly some CES stage i and we select the unit number j randomly from the available set \(J_i\). We compute the transition probability p by \(p=\exp (-\Delta \mathcal {E} /(k_{\mathrm {B}}\cdot t ))\), where \(k_{\mathrm {B}}\) is the Boltzmann’s constant, and \(\mathcal {E}\) is the change of the energy level [20]. We use \(k_{\mathrm {B}} = 1\) and \(\gamma = 1\) [20], such that \(\texttt {p} = \texttt {exp}~ (-\texttt {deltaCost}/\texttt {t})\) (line 13).

Our implementation of SA is based on generating a sequence of homogeneous Markov chains of finite length \(L_{max}\) that depends on the problem size [1]. At a given value of temperature t, the iterations repeat Lmax times, where Lmax is calculated as the equipment set size, i.e., \(L_{max}=\sum _{i=1}^IJ_i\), where \(J_i,i=\overline{1,I}\) is the number of equipment units variants for stage i. We permute the feasible solution, and we select at each iteration a random unit (line 3) in one random stage (line 2). At the end of each iteration, temperature t is decreased using a cooling procedure defined by an initial temperature Tinit, a rule for reducing temperature, and a final temperature Tfinal (line 3); in our case, we use the price of the cheapest unit for Tfinal. Effective cooling rule is essential for reducing the amount of time required by the algorithm. In line 16, we apply the fast cooling rule \(t= \sigma \cdot t\) [18], where we follow [1] in choosing \(0.8 \le \sigma \le 0.99\).

4 Parallel implementation on GPU

In Fig. 2, we show the parallel implementation schema of the hybrid \((\hbox {ACO}+\hbox {SA})\) approach on a system comprising a CPU and a GPU.

Fig. 2
figure 2

GPU parallelization scheme

The implementation code consists of a sequential host code for CPU that invokes multiple parallel threads on the device (GPU); these threads execute the kernel code.

The implementation proceeds in five steps as shown in Fig. 2:

  1. 1.

    CPU receives input data (production horizon \(T_{max}\), number of accessible equipment set \(J_i\), number of stages I, etc.), initializes the parameters for ACO and SA, and starts the kernel function for ACO on the GPU.

  2. 2.

    In the ACO kernel, GPU searches for the first feasible solution—the initial CES variant, using the Multiple Ant Colonies approach [7]: parallel threads solve a problem independently. After one of the threads finds a solution, all threads terminate. The probability of finding a solution increases with an increasing number of threads, which reduces the search time.

  3. 3.

    The obtained feasible solution is distributed by CPU between threads as a seed for SA; GPU starts the SA kernel function.

  4. 4.

    GPU searches in each thread for the optimal solution using the SA kernel with the seed from ACO. We do not try to reduce the time of one iteration, but rather increase the number of iterations executed simultaneously. Since each thread executes an independent instance of SA, this increases the chance of the algorithm to converge to the global optimum. Using a larger number of threads increases the probability of eventually finding a nearly optimal solution, but it does not reduce the run time of the algorithm.

  5. 5.

    GPU threads send the SA solutions to the CPU, which chooses the best of them—this is the final problem solution.

figure d

4.1 Host code

In the host code shown in Listing 4, the input data are loaded from a file. We provide the number of threads as a program launch parameter (line 2). The host starts the kernel of the ACO algorithm on the GPU.

The ReadInputData() and PrepOperationalData() functions make all necessary memory allocations and variables initialization including the pseudo-random number generator for each thread. The flag isFound is initialized on host with 0 (line 12) and then used on GPU for controlling threads. If isFound == 0, the thread continues its work; otherwise, if the first feasible solution is found and isFound != 0, all threads terminate. While solution is not found, the host will repeat search (line 14). The host takes the first feasible solution as an initial guess for each thread (line 24). After the host sends the initial guess for each thread to the device (line 26), it starts the kernel function SA() (line 28) to find the optimal solution using SA algorithm. When all threads finish their work, the host receives the results (line 29).

4.2 Kernel code for ACO

Our parallel implementation of ACO is shown in Listing 5, where each thread simulates the work of one ant colony.

figure e

Initially, all edges of the search tree are assigned random pheromone values in [0, 1] (lines 4–5). We control threads by the global flag isFound and the local iteration counter iterCounter. A thread changes the flag using atomicAdd() if it has found a feasible solution (line 11). If ants in a thread cannot find the solution after maxIterNumber iterations (which is possible for stochastic algorithms), then the thread terminates.

5 Evaluation

We conduct experiments on the following system: a CPU: Intel Xeon 4 cores with hyper-threading, 3.7 GHz with 16 GB RAM, and a GPU: NVIDIA Tesla K20c with total 2496 CUDA cores. We use CUDA version 8.0. Since both SA and ACO are probabilistic, their results are different if run multiple times. So, we run each instance for 100 times and take the average value. We evaluate the test case: the design of a CES consisting of 16 processing stages with total \(2^{16}\) to \(12^{16}\) CES variants. We use the B&B algorithm from [4] to find the global optimal solution, and we use our hybrid metaheuristic approach \((\hbox {ACO}+\hbox {SA})\) to compare its results with the solution obtained by B&B.

In Fig. 3, we show the run time of the \((\hbox {ACO}+\hbox {SA})\) parallel program (averaged over 100 runs) depending on the number of threads for the CES example with \(10^{16}\) variants. We observe that, while on 100 threads the ACO takes 91% of the total run time, it is decreasing to about 10% on \(>2000\) threads: with more threads SA improves the quality of solution, but not the speed, whereas ACO finds a solution faster with a higher probability.

Fig. 3
figure 3

Run time of \((\hbox {ACO}+\hbox {SA})\) depending on threads number

Fig. 4
figure 4

Per cent deviation of the found solution from global optimum

In Fig. 4, we show the deviation of the solution obtained by our hybrid algorithm, calculated as \({|\hbox {observed} - \hbox {expected}|}/{\hbox {expected}} \cdot 100\%\). Here, the expected value is the optimal value obtained by B&B, and the observed value is the averaged value over 100 runs of our algorithm. We observe that on 100 threads, the deviation is about 0.38% and it decreases to almost 0% on 2000 threads, i.e., we obtain a near-optimal solution.

In Fig. 5, we show the deviation for different approaches: we measured the deviation of our hybrid \((\hbox {ACO}+\hbox {SA})\) algorithm (sequential on CPU, parallel on GPU) from the optimum. The parallel program was run using up to 2500 threads. Our parallel hybrid algorithm yields a nearly optimal solution. At the same time, the deviation achieved by the sequential implementation of our algorithm is \(<1\%\) for small problem sizes, but it increases to 14% for bigger sizes, because the sequential implementation makes only a single run of SA: the probability of finding a good solution for small sizes is higher.

Fig. 5
figure 5

Deviation for different problem size

Fig. 6
figure 6

Run time depending on problem size

In Fig. 6, we show the run time of our hybrid approach versus B&B depending on the problem size (vertical axis has a logarithmic scale). The run time of B&B increases exponentially: for size \(12^{16}\) it becomes prohibitively long at about 134 h. The run time of the sequential hybrid algorithm is less than 0.1 s for the smallest problem, and it increases to about 580 s for the maximal problem size. The run time of the parallel implementation increases smoothly from 5 to 227 s. For larger problem sizes (\(10^{16}\)\(12^{16}\)), the parallel implementation is faster than the sequential version by about 2.6–2.8 times.

The speedup of 2.7 might seem small, but the quality of the solutions using the parallel implementation is much higher: the deviation from the global optimum is \(<0.01\%\) for the parallel implementation against \(>10\%\) for the sequential version. The high quality of solutions is achieved by the multiple runs of SA: for more threads, the probability is higher that one of the threads finds a nearly optimal solution. The run time and the quality of solutions are influenced by the parameters of metaheuristic algorithms that we choose empirically after several experiments: \(\alpha =0.4\) and \(\beta =0.6\), with the colony size \(M=100\), while \(\rho =0.9\) and the cooling rule constant \(\sigma = 0.9\).

6 Conclusion

We develop a novel hybrid \((\hbox {ACO}+\hbox {SA})\) metaheuristic approach to the optimization problem for multiproduct batch plants design, we parallelize it on GPU, and we systematically compare it to the deterministic B&B algorithm.

We show experimentally that increasing the number of threads accelerates the solution using ACO, and it also increases the quality of the solutions obtained by SA. Experiments confirm the solutions obtained by our parallel hybrid approach are very near to the global optimal values obtained by B&B, but our metaheuristic approach finds the solution much faster: in few tens or hundreds of seconds instead of tens or hundreds of hours.