1 Introduction

Nature-inspired and evolutionary models have been exploited for solving a variety of hard-to-solve computational problems. One of the central benefits of these models is the facilitation of using parallel processing, which is the simultaneous usage of more than one processor cores in executing different parts of the same program. The key point with designing parallel processing in evolutionary computation is twofold. On the one hand, a program needs to be divided so that separate cores, without interfering with each other, can work together. On the other hand, the cores should communicate the best results they produce so that each core can use the experience of the other cores in evolving solutions.

With respect to this twofold consideration, this paper presents an evolutionary metaheuristic, in which different semi-independent threads, each running on a separate CPU core, work separately and interact occasionally with one another to inform each other about the best overall solution obtained. It is this best overall solution that becomes the focus of the extra search and its vicinity is searched by all the threads. According to the classification made by [24], the proposed parallel strategy can be considered as a co-operative multi-search.

A thread operates in three separate layers including (i) a heuristic construction module to generate initial solutions, (ii) a genetic algorithm module to combine high quality solutions, and (iii) an enhancing module to further improve the solutions. The enhancing module not only improves the best solution obtained by the corresponding thread, but enhances the best overall solution obtained by all other threads as well. The presented procedure, called the Parallel 3-layer Hybrid (P3H), considers each thread as the combination of six synergetic components, namely (i) an initial solution construction method, (ii) a crossover operation, (iii) a mutation operation, (iv) a restrictedTabu component, (v) a large neighborhood scheme, and (vi) a perturb mechanism.

To explore the search space independently and effectively, a thread uses its components in a randomized manner. The benefit of such randomization is twofold. First, it circumvents the possible search redundancy which could occur because all the threads are aimed at improving the same high quality solution. Second, even for a single thread, the randomization prevents it from doing the same set of operations if a solution has been encountered more than once.

The key contribution of this study is to show the effectiveness of this metaheuristic through synergetic integration of these six modules for three famous and hard-to-solve optimization problems, namely the job shop scheduling, the permutation flowshop scheduling, and the quadratic assignment problems. The rest of this paper is organized as follows. Section 2 discusses the related work for each of the problems. Section 3 presents the formulation of these three problems. The stepwise description of the P3H is outlined in Section 4, and the results of computational experiments are in Section 5. In Section 5, also the setting of parameters for the P3H with respect to each of the three problems is discussed. Concluding remarks and some suggestions for future work are described in Section 6.

2 Related work

The three problems on which the P3H is applied are Permutation Flowshop Scheduling Problem (PFSP), Job shop Scheduling Problem (JSP), and Quadratic Assignment Problem (QAP). While theses problem formulations are explained in detail in the next section, the related work is outlined here. In effect, this section is divided into two subsections. The first subsection describes related work to the components of the P3H, and the second subsection presents the related work to parallel methods for the PFSP, JSP, and the QAP.

2.1 Related work to the components of the presented method

The most effective strategies for hard-to-solve combinatorial optimization problems are categorized into the three groups of (i) constructive methods, (ii) local search techniques, and (iii) population-based methods. Constructive methods build a solution by sequentially deciding the values of solution components, i.e. decision variables. Imitating the survival of living creatures is the key idea used in the design of a popular constructive method called Ant Colony Optimization (ACO) [27]. The Rollout algorithm [15] is another popular constructive method capable of producing high quality solutions.

Unlike construction methods, local searches take a complete solution to a problem and by checking its immediate neighbors, which are similar solutions with one or two minor difference, aim to find an improved solution [71]. Getting stuck in local optima is the main problem with local search techniques and has been depicted in Fig. 1.

Fig. 1
figure 1

A situation in which a local optimal solution has been surrounded by many other local optimal solutions

The problem with local optimality has been tried to be addressed in different ways. Whereas in the Iterated Local Search (ILS), the starting solution of the local search is derived by a perturbation of the previous local optimum found [44], in the Variable Neighborhood Search (VNS) a set of neighborhoods of different orders are employed [35]. On the other hand, Tabu Search (TS) explicitly exploits short-term and long-term memory to guide the search [33], with short-term memory being used to keep track of recently visited solutions and long-term memory to monitor the search progress.

Genetic Algorithms [36], Particle Swarm Optimization [22], and scatter search [33] are prime examples of population-based methods. Population based methods are composed of five main components, namely (i) an encoding/decoding scheme that maps every solution (phenotype) to a chromosome(genotype), (ii) a fitness function that assigns a goodness to each individual, (iii) a parent selection strategy which determines which individuals are nominated as parents to produce offspring, (iv) a survival selection strategy in which a rule is defined for deciding which individuals will be survived to the next generation, and (v) reproduction operators, which specify the way two or more encodings are combined to produce an offspring encoding.

Towards making population-based methods more effective, “go with the winners” strategy [5], population based incremental Learning (PBIL) [14], and path relinking and scatter search [34] concepts are instrumental. In effect, path relinking has been successfully applied to the QAP [2] and the PFSP [53].

2.2 Parallel methods for the PFSP, JSP, and QAP

Since the early development of parallel processing technology in mainframes, many researchers have concentrated on solving hard-to-solve problems through integrating parallelization techniques with metaheuristics. In [23], a recent overview of parallel metahueristics was presented, and the promising performance of parallel cooperative strategies has been emphasized. In cooperative strategies, semi-independent procedures occasionally synchronize by sharing some information during the search progress. These methods are typically referred to as cooperative multi-search in the literature, and their success in tackling hard, NP-complete optimisation problems are further stressed in [4, 65, 67], and [24]. More recently, new techniques on adopting parallelization using Graphical Processing Units (GPUs) for well known metahuristics such as ACO and GA has been proposed [20, 50].

Single Program Multiple Data (SPMD) and Threads models could be seen as two commonly used parallel programming models employed in metaheursitcs. SPMD is a high-level parallel programming model in which all independent tasks run their copy of the same program simultaneously, with different input data or initial points. In the case of metaheuristics, this initial starting point could simply be a different seed value for the pseudorandom number generators. The threads model, on the other hand, is a type of shared memory programming, in which a single process can have several concurrent execution paths. Independent threads can also communicate with one another through a global (shared) memory. This is in line with parallel cooperative strategies, as the knowledge of search space can be shared and utilized by all threads.

It should be noted that multithreading is not necessarily equivalent to parallel computing. In this study, however, a threads model is used, and implemented using OpenMP API [21]. Provided that the number of threads is less than or equal to the number of CPU cores, OpenMP API typically ensures that each thread is run on a separate CPU core [21].

From the early beginning of parallel processing technology, interested researchers have focused on solving the PFSP, JSP, and QAP, as three highly-applicable and challenging problems through multithreaded and multi-core-based procedures. In the following subsections, these parallel procedures are surveyed for each of these problems, separately.

2.2.1 Parallel methods for the PFSP

As a source of parallelism, islands models in evolutionary searches can exchange individuals through the entire run of the algorithm, with this exchange of individuals being generally termed as “migration” [25, 73, 74]. Among several island model GAs proposed for the PFSP, we can mention the one presented in [59], where the authors conducted experiments on randomly generated instances with 40–100 jobs and 4–10 machines. Another island model is in [17], and has been presented for the FSP with total completion time criterion. In this island model, crossover operator is performed on individuals from different islands. The authors conducted experiments on Taillard’s benchmarks [63].

The parallel simulated annealing of [75] is another related work. The authors suggest an island model parallel strategy in which cooperation occurs when the global best solution is being updated. The authors compared the independent and cooperative variant of the proposed procedure with NEH heuristic of [46] and concluded that the cooperative variant with four processors yields better results.

A parallel Tabu search has also been presented by the same authors [16] where the search threads cooperate by broadcasting the global best solution, wherein a specific thread is responsible for managing the exchange of global best solutions. The authors experimented on a selection of Taillard’s benchmark instances [63] as well as some randomly generated instances.

Among the more recent parallel approaches are that of [70] and [52]. In [70] a cooperative island model has been proposed wherein, similar to our approach, same algorithm is run on different islands and occasional cooperation occurs at different stages in the algorithm. In the parallel hybrid proposed by [52], different allocations of Memetic Algorithms and the Iterated Greedy (IG) procedure [56], to multiple threads, have been studied.

2.2.2 Parallel methods for the JSP

Among the earliest parallel strategies, we can mention the parallel tabu search presented in [64], where the author presents a tabu search method and describes why it is more efficient than the shifting-bottle-neck procedure. In the same paper, a parallel variant of the tabu search has also been presented that divides a problem into k sub-problems, each containing a sub-set of jobs. Each of these sub-problems is then solved independently and, in the final stage, the sub-problems are aggregated to form a solution to the original problem.

Aiex et al. [3] present a hybrid of GRASP and path-relinking which operates on a pool of elite solutions. In effect, GRASP is used to generate a pool of elite solutions and path-relinking is applied to (i) a solution produced by GRASP and (ii) an elite solution chosen from the pool. The path-relinking result is used to update the pool of elite solutions. Also, two similar parallel variants have been proposed by [3], namely collaborative and non-collaborative. While the non-collaborative version is a SPMD approach in which each thread executes a copy of the algorithm, in the collaborative version the pool of elite-solutions is shared among threads. The authors also describe why their collaborative scheme presents a better speed-up factor than their non-collaborative scheme.

Another related work is that of [58] in which a parallel Variable Neighborhood Search (VNS) is proposed for the JSP. The authors’ proposed VNS consists of two main components, namely shake and LocalSearch procedures. Whereas shake procedure has the role of perturbing a given solution, the employed LocalSearch procedure improves the given solution based on SWAP or INSERTION neighborhoods. Therefore, the proposed VNS of [58] consists of three main steps in each iteration: (i) constructing an initial solution, x, (ii) performing the shake procedure on x, and storing the result as x (iii) performing the LocalSearch on x.

Sevkli and Aydin [58] compared four different parallelization of their proposed VNS algorithm. In the first scheme, a copy of VNS is run by all processors, starting with a single initial solution. After the completion of VNS by all processors, the best solution found among all processors acts as the initial solution for the next iteration. While the first scheme waits for all threads to complete their VNS before starting the next generation, in the second scheme the parallelization strategy is asynchronous and as soon as a processor finishes its VNS run, its next iteration will start with the incumbent overall best solution at that time. The two other parallel schemes proposed by the authors are decentralized in the sense that the search threads communicate through a network of processors and no central synchronization is occurred. Two network structures proposed are: unidirectional-ring, and mesh. The authors argued that the ring topology has the best performance over all other schemes.

2.2.3 Parallel methods for the QAP

One of the earliest and famous procedures for the QAP is the Robust Tabu Search (RTS) developed in [62]. In that paper, the author presents two parallelization schemes for the proposed RTS: In the first scheme, the process of evaluating all potential neighbor solutions is divided among different processors for the purpose of reducing the computation time. In effect, after each processor evaluates its portion of neighborhood, all threads synchronize at a single point to identify the best possible neighbor. The other proposed parallel strategy of [62] is the SPMD approach of running RTS instances independently from different initial solutions.

Another related parallel procedure to our proposed algorithm is the Cooperative Parallel Tabu Search (CPTS) of [38], which incorporates the RTS of [62]. The CPTS is essentially an SPMD parallel algorithm in which a copy of RTS is run in each thread, and each thread stops and synchronizes before and after running the RTS procedure. In particular, the cooperation between processors occurs by maintaining a small set of elite solutions, named reference set. Basically, the number of solutions in the reference set is equal to the number of processors and the reference set is shared among all processors.

Among the other parallel strategies for the QAP, we can mention the parallel hybrid of ant-colony and tabu search [66], and the island model parallel genetic algorithm of [69]. In both methods, a master-slave paradigm is used and the global information regarding best solutions in islands and/or pheromone trail matrix is communicated among the processors.

The other related work is the single instruction multiple data tabu search (SIMD-TS) of [79], in which tabu search procedures are run on each thread, and probabilistically, certain diversification and intensification actions are performed in each thread. It is worth noting that the SIMD-TS has specifically been designed to be run on GPUs instead of ordinary CPUs. An extensive review of recent exact and heuristic methods for the QAP can be found in [29]

3 Problem formulation

The PFSP, JSP, and QAP can be successively described as follows. Since the PFSP is a special case of the Flowshop Scheduling Problem (FSP), the FSP needs to be discussed first.

The FSP is a subclass of scheduling problems in which n Jobs have to be processed on m machines, with the goal of finding an optimal processing sequence of jobs on machines. The optimality criterion is mainly the completion time of the last operation on the last machine (makespan). In the FSP, each job should be processed on the same sequence of machines. For instance, if job 1 should be done on machine 4, 2, 3, and 1, one after another, then all other jobs have the same order. The PFSP is a special case of the FSP in the sense that all machines have to process all the jobs in the same order. In the PFSP, assuming that the jobs are numbered 1,2,…,n, the goal is to find an optimal permutation of jobs π1,π2,…,πn so that the completion time of the last job on the last machine is minimized. The completion time of job πi on machine j, C(πi,j), can be calculated as follows.

$$ C(\pi_{i},j)=\max{\{C(\pi_{i-1},j),C(\pi_{i},j-1)\}}+T(\pi_{i},j) $$
(1)

where C(π0,j) = C(πi,0) = 0 and T(i,j) is the processing time of job i on machine j.

The PFSP has diverse applications in manufacturing and has increasingly attracted the attention of researchers to assess new algorithmic ideas. The PFSP is NP-hard [54, 55] and even some instances with a moderate number of jobs and machines have not been solved to optimality.

The second problem is the JSP. In the JSP, m jobs have to be processed on n machines and, unlike in the FSP, each job may have different processing order on machines. However, the same as in the FSP, each machine can process only one job at a time and each job can be processed only on one machine at a time. The goal is to find a schedule which minimizes the time required to process all jobs on all machines, i.e. the makespan. The JSP is one of the hardest combinatorial optimization problems [32]. An standard, well-known mathematical formulation for the JSP is disjunctive graph formulation, which can be found in [1].

The QAP is in the class of Facility-Location problems, with diverse applications in different areas such as factory layout design as well as the problem of placing electronic components in a circuit or microchip. In the QAP, we are given a set of locations and a set of facilities, both having the same size, n. In addition, two n × n matrices, D and F are given as input, with dkl indicating pair-wise distance between locations k and l, and fij specifying pair-wise flow between facilities i and j. The objective is to find a one-to-one mapping between facilities and locations which minimizes the cost of flow, C, calculated as a summation of pair-wise flow between facilities multiplied by their distance.

A solution to the QAP can be simply represented as a permutation of facilities. For example, the permutation π = (4,3,1,2) represents the solution in which facilities 4, 3, 1, and 2, are placed in locations 1, 2, 3, and 4, respectively. Belonging to the class of NP-hard problems, the QAP is computationally demanding even with respect to finding a solution with the guarantee of being in a given distance of the optimal solution. With π representing the a solution and π denoting the set of all possible permutations, the QAP objective function formulation is as follows.

$$ \min_{\pi \in {\varPi}}{C(\pi)}=\sum\limits_{i=1}^{n}\sum\limits_{i=1}^{n}{f_{ij}d_{\pi(i)\pi(j)}} $$
(2)

It is worth noting that P3H could be applied as a general purpose metaheuristics, and any fitness or cost function and set of constraints could be adopted. Furthermore, any constraint in the given problem, could be seen as defining the feasible search space. In this paper, we identify a good solution, as the one having lower cost, i.e. a minimization problem, and the constraint are satisfied by considering only valid permutations, as feasible solutions.

4 The P3H

The P3H employs multi-threading, and in each thread a construction technique improves the quality of genomes in a genetic algorithm, with effective exploration/exploitation balance being achieved through an unrestricted and egalitarian parent selection. Also, restricted and elitist offspring selection contributes to the aforementioned balance. The threads are not fully independent in the sense that in each thread, the employed local search improves the best overall solution obtained in all threads. The three interacting layers of the P3H are depicted in Fig. 2, and the detailed stepwise description of the P3H has been shown in Fig. 3.

Fig. 2
figure 2

Three layers of the P3H procedure and their interactions

Fig. 3
figure 3

The pseudocode of parallel three-layer hybrid (P3H)

It should be noted that P3H can be simply reproduced for similar optimisation problems by (i) implementing the general, problem-independent modules, namely heuristic construction, genetic algorithm, and enhancing module, and (ii) incorporating six problem-specific modules. While the general modules are described next, the problem-specific components, outlined in Table 1, are explained in detail in the following subsections.

Table 1 The selection of P3H problem-specific modules for the PFSP, JSP, and QAP

As is seen in Fig. 3, line 2 starts the threads, with the block of code instantiated for each thread being shown in lines 3 through 50. It is worth noting that since the P3H is a SPMD parallel approach, it is an identical copy of the same routine which is run by each thread.

Each thread operates in three layers. In the first layer, a number of high-quality solutions are constructed in the sense that among all of the constructed solutions, solutions with lower quality are ignored and those with higher quality are kept in a list called eliteHeap. Part of these kept solutions which have the highest quality are used in the second layer, and the rest can be used in the third layer.

In the second layer, an initial population of solutions is formed by selecting top solutions among the eliteHeap, with this population being evolved through the genetic algorithm. For the purpose of evolving the solutions, the algorithm works as follows. First, from the current population, whose size is shown with populationSize, randomly populationSize/2 pairs of parents are selected. Then, a problem-specific crossover operator is applied to each pair of parent genomes. This is followed by a mutation operator, so that an offspring population with the same size of the current population can be generated. Finally, from the combined population of parent and offspring genomes, only a half with the highest quality enter the next generation and the other half are ignored.

In the third layer, a fine-tuning procedure, which works based on a point-based strategy, is run on the solutions in the constructed pool, using the solutions left in the eliteHeap. As line 27 of the pseudo-code presented in Fig. 3 shows, this point-based strategy is a problem-specific restricted tabu search, called restrictedTabu(). It is restricted in the sense that after a maximum number of iterations with no improvement, the procedure is halted and the best solution encountered is returned. As mentioned, this restricted tabu strategy is problem-specific and is described for each problem in the next subsections.

Towards its fine-tuning process, the restricted tabu procedure is aimed at performing a number of actions for keeping a balance between intensification and diversification. As is indicated in the switch-case starting at line 29 of the pseudo-code in Fig. 3, this procedure randomly selects one of four different actions. The selection is made based on given probabilities as input parameters, and these four actions, from which stochastically an action in the switch-case is serially selected and performed, are as follows.

The first action is selected with the chance of intensificationProb and causes a problem-specific large neighborhood search to be performed on the current solution. Also, whenever this large neighborhood search is unable to improve the current solution, a perturbation is performed on the current solution, and replaces it with one of its neighbors.

The second action is selected with the chance of concentrationProb, and replaces the current solution with the global best solution among all threads. Then, in order to prevent the algorithm from revisiting the same solution, this global best solution undergoes a perturbation. The third action is chosen with the chance of perturbationProb and only causes a perturbation to be applied to the current solution. The main purpose of applying a perturbation is to diversify the search and assist the algorithm to escape local optima.

The fourth action is selected with the chance of refreshProb. Based on this action, the current solution is replaced with the unused highest quality solution from the previous layers of the same thread. The selection of a solution to replace the current solution is accomplished in the following order. Initially, the best unused solution of the final population constructed in second layer is considered and removed from the pool for providing a new solution. If there is no element left, however, the next highest quality solution left in the eliteHeap of the first layer is considered and removed from the eliteHeap for providing a new solution. In the case where both the pool and the eliteHeap have become empty, a new solution is constructed with the problem-specific construction method specified in the first layer.

It is worth noting that the sum of the four parameters of intensificationProb, concentrationProb, perturbationProb, and refreshProb is 1, implying that depending on different setting of these parameters, different balancing between intensification and diversification can be expected. Moreover, because of their constant sum, introducing only three parameters is enough and the remaining parameter can be calculated based on the given parameters.

As is shown in the pseudo-code of the P3H, the underlined modules are problem-specific and should be separately implemented for any specific problem. These problem-specific modules comprise the major components of the P3H and are as follows (i) an initial solution construction method, (ii) a crossover operation, (iii) a mutation operation, (iv) a restrictedTabu component, (v) a large neighborhood scheme, and (vi) a perturb mechanism. The selection of these problem-specific modules is outlined in Table 1, and is described in detail for the PFSP, JSP, and QAP, respectively, in the three following subsections.

4.1 Problem-specific modules for the PFSP

The aforementioned six components, for the PFSP, have been implemented as follows. As the first component, the solution construction module, four different initial solution construction methods have been considered. These methods include (i) a uniformly-at-random construction method, in which all n! job permutations have an equal chance of (1n!) for being selected, (ii & iii), two re-blocking construction methods proposed in [9], and (iv) the RJP (Recursive Johnson Procedure) proposed in [6]. All of these methods have been tested in Section 5 and, based on the results of experiments, the re-blocking mechanism has been adopted. We denote these four methods with (i) UNIFORM, (ii) REBLOCKI, (iii) REBLOCKII, and (iv) RJP, respectively.

The other two components are the crossover and mutation operator. The Longest Common Sub-sequence (LCS) developed in [37] has been used as the crossover operator, and a single swap and insertion move, each with a chance of 0.5, have been used as the mutation operator. It is worth noting that the mutation operation is applied with a small probability, (0.1–0.2), given as an input parameter.

The choice of restrictedTabu module, as the next component, is based on a modified version of the tabu search proposed in [47]. In this modified implementation, first, an ordinary insertion local search is run on the current solution. Based on using the critical path information, this insertion local search avoids unfruitful moves. Moreover, by using the forward and backward completion time matrices, it reduces computation time [6, 9]. The restrictedTabu module stops when, for a fixed number of iterations, no improving move has been found.

The last two components are the Large-Neighborhood-Improvement and the Perturb module. The iterative decomposition procedure of [6] is used as the large neighborhood improvement. The method divides the solution into substrings and iteratively optimizes each substring, called chunks. The perturb module employed is the combination of a 3-replacement move, and a random insertion move. The 3-replacement move selects three random indexes, i1,i2,i3, and puts the job at position i1 in position i2, the job at position i2 in position i3 and finally the job at position i3 in position i1. A parameter called perturbIntensity controls the degree of such perturbation by the number of times this 3-replacement move occurs. For instance, when this parameter is set to two, this 3-replacement move occurs two times.

4.2 Problem-specific modules for the JSP

The six components used for the JSP are as follows. With respect to the first component, solution construction, three initial methods have been tested. The First method is the forward-backward Semi-Active Schedule Generator (SASG) proposed in [7], and the second method is the forward-backward Randomized Giffler and Thompson (RGT) construction method proposed in [8]. This method primarily generates non-delay schedules using the well-known Giffler and Thompson procedure and probabilistically improves a non-delay schedule based on a forward-backward mechanism.

The other employed construction method, called Iterative Carlier (IC) procedure, uses the Carlier one-machine optimization procedure [19] and applies it iteratively. The process starts with an empty schedule, and, then, based on a random order of machines, it schedules the machines incrementally. In each iteration, through solving the associated one-machine problem, each single machine is scheduled to optimality, and the result updates the current solution. As will be discussed in Section 5, based on the computational experiments, among them only the RGT, has been adopted.

With respect to the second and third components, crossover and mutation operators, Linear Order crossover (LOX) of [30] and a simple mutation operator which performs a single swap move on a random machine have been considered.

Since the encoding used in the JSP is based on the simple permutation of operations on machines, LOX has to be individually applied to each pair of permutations on the same machine independently. However, the main problem with such independent application of LOX is that the resulting offspring solution may be infeasible. To rectify this possible infeasibility, we have used the Giffler and Thompson procedure to not only remove any infeasibility but create an active schedule as well.

The fourth component, the restrictedTabu module, has been implemented based on a modified version of the Limited Tabu Search (LTS) proposed in [8]. Whereas the same as the LTS, the employed module works based on N5 and N6’ neighborhoods [48], unlike in the LTS, the restrictedTabu module keeps no history of solutions visited during the search process. Furthermore, similar to LTS, for evaluating potential neighbor solutions, it uses an estimate of makespan and avoids any exact evaluation.

The fifth component, the Large-Neighborhood-Improvement module, has been adopted as the Forward-Backward Shifting Bottleneck Procedure (ForwardBackwardSBP) which has originally been proposed in [7]. The ForwardBackwardSBP works as a solution-refinement routine which incorporates the post-optimization stage of Shifting Bottleneck Procedure (SBP) of [1].

The Perturb module, as the sixth component, operates based on the N1 neighbourhood [72] and simply performs a random swap of two neighbouring operations on a random critical block. For this module, the perturbIntensity parameter determines the number of times a move is performed. The updating of the makespan and critical path, which need to be done after the completion of each move, are performed by the method presented in [8].

4.3 Problem-specific modules for the QAP

For the QAP, the six major components of the P3H have been set as follows. For the first component, the construction module, three construction methods have been implemented. The first method, named as UNIFORM, simply assigns facilities to locations uniformly at random. The next tested construction module has been developed in [77], and is called Randomized Heuristic (RH), aiming at assigning facilities having high interactions (flows) to locations having low distance to others. In this way, a percentage of facilities having high interactions are randomly selected and assigned to locations having small distances from one another. Then the remaining facilities, one after another, are allocated to empty locations in such a way the total assignment cost experiences the least increase. In effect, the second part of this module works similar to the procedure presented in [43].

The next construction method tested has originally been presented in [11] and is a modified implementation of the procedure presented in [43]. In the employed procedure, named as GREEDY, a solution is incrementally constructed by deciding for the best value of each solution component, one at a time, in a greedy manner. First a random permutation of facilities is generated, and, based on this order, the best location for each facility is determined and is added to the partial solution. In the Section 5, all these three constructed methods have been tested and, because of its higher performance, the GREEDY method has been adopted.

As the second component, the crossover operator, the cohesive merge procedure of [28] has been implemented. The cohesive merge is a computationally intensive crossover scheme which works based on the median distance of locations (sites). In our implementation, first, as an initialization phase, the median distance for all location is calculated. Next, given two parent solutions, every location is tested as a potential pivot site, resulting in 2n different solutions. From these 2n generated solutions, only the top two are selected as the offspring.

As the third component, the mutation operator, a combination of two different methods has been considered and applied randomly. The first operator is a simple swap of locations for two facilities, and the second method applies a random cycle of size 3. For instance, a cycle (f1,f2,f3) indicates that facility f1 should be re-assigned to the location of facility f2, facility f2 to the location of facility f3, and facility f3 in the empty location which had previously been occupied by facility f1.

As the fourth component, the restrictedTabu module, the Extended N* local search procedure [77] has been adopted. This procedure starts with simple swap neighborhood, and continues using this neighborhood while improvement is possible. Then, it switches to Nλ neighborhood [43] in which after applying each degrading move, the two facilities relating to that swap are marked as Tabu for being prevented from participating in further iterations.

The fifth component, Large-Neighborhood-Improvement, has been selected from the Neighborhood Decomposition-based Search (NDS) developed in [11]. In this method, the relocation matrix associated with each solution of the QAP is passed to the Quick Match (QM) Linear Assignment solving routine [42], and the linear assignment proposals are extensively evaluated for improvement.

The sixth component, the Perturb module, simply applies a number of random cycles of size 3. The number of cycles applied is determined by the perturbIntensity parameter. For instance, when perturbIntensity is set to 5, the number of times the cycle is applied is five.

5 Computational experiments

The P3H has been coded in C++ and computational experiments have been performed on a Quad-Core PC with 3.4 GHz clock speed CPU. The Visual C++ has been used as the compiler and parallelization is based on OpenMP programming interface [21], with up to four cores of the CPU being used. Because of applying the P3H to three entirely different problems, a wide variety of benchmark instances have been involved in the experiments. For each of these three problems, their corresponding benchmark instances have been selected as follows.

For the PFSP, the Taillard’s repository of benchmark instances [63], have been employed. Taillard’s repository consists of 120 instances, comprising 12 classes, with 10 instances existing in each class. The number of jobs and machines of these instances are in the range 20–500 and 5–20, respectively.

For the JSP, 43 instances have been taken from the ORLIB website. This website is managed by Brunel University, UK, and contains a large number of benchmark instances produced for operational research problems. These instances are (i) a selection of 11 hard-to-solve instances from [41], named laxx, (ii) 3 classic instances from [31] named ft06, ft10, and ft20, (iii) 4 instances yn1-4 from [76], (iv) 10 instances, swv01-10 from [60], (v) 5 instances abz5-9 due to [1], and (vi) 10 instances orb01-10 from [12].

For the QAP, a representative set of 29 benchmark instances, having up to 90 facilities/locations, have been selected from the QAPLIB [18]. These instances include 16 instances named taixxa, and taixxb, 7 instances named skoxx and a selection of 6 representative instances, namely els19, bur26d, nug30, ste36c, lipa50a, and tai64c. The number literals (xx) in the instance name indicate the instance size. It is worth mentioning that these instances are among the most popular instances in the literature.

Since the initial solution construction method plays a key role in affecting the overall quality of the solutions produced by the procedure, we first analyze the effect of selecting different initial solution construction methods for each of the three problems. Then, with respect to each problem, the values of parameters set for the P3H are discussed. Analyzing the parallelization effect is provided next. Finally, a brief evaluation is provided comparing the performance of the P3H with that of other state-of-the-art procedures.

5.1 Comparing the initial solution construction methods

As mentioned in Section 4.1, four heuristic construction methods have been implemented for the PFSP, namely (i) UNIFORM, (ii) REBLOCKI, (iii) REBLOCKII, and (iv) RJP, respectively. To evaluate these methods, a set of 12 Taillards’ instances with different size has been used, and each construction method has been run 1000 times for each instance, with different random seed values. Table 2 compares these four methods. The column %DEVavg shows the average percent deviation of solution Makespan (M) from the Upper Bound (UB), calculated as (MUB)/UB ∗ 100. The column %STDEV shows the sample standard deviation percentage from the best known upper bound. In other words, STDEV = 1N− 1 ∗∑ i= 1N(MiMavg)2 and %STDEV = STDEV/UB ∗ 100, with N being set to 1000, and Mi showing the makespan value for run i. Also, Mavg denotes the average makespan over 1000 runs and for each method, the total construction time for generating all 12000 solutions has also been shown in Table 2.

Table 2 Comparing the performance of the three different construction methods for the PFSP

With the realistic assumption of having normal distribution for the %DEVavg, Fig. 4 shows the estimated probability density functions for all construction methods. The probability that a construction method generates solutions with zero percentage deviation from the best known solution (upper bound), i.e. P(X ≤ 0) = FX(0), can be calculated for all four methods as 1.5 ∗ 10− 11, 2.8 ∗ 10− 9, 8.3 ∗ 10− 12, and 4.8 ∗ 10− 28, respectively. This indicates that the second method, the first re-blocking strategy, has the highest chance of generating solutions with zero percentage deviation from the best known solution.

Fig. 4
figure 4

The estimated normal distribution functions for the four different construction methods for the PFSP

In effect, the first method, as the closest competitor of the winner method, has to be at least 100 times faster than the winner method to outperform the winner, whereas comparing construction times shows that the first method is only 3 times faster than it. Hence, based on these observations, the first re-blocking method has been adopted as the construction module for the PFSP.

For selecting the best construction method for the JSP, all of the three presented methods have been considered, namely SASG, RGT, and IC. Each of these methods has been run with 1000 different random seed for each of the 43 JSP benchmark instances mentioned. As can be seen in Table 3, IC and RGT show superior performance over SASG. However, it should be noted that IC is considerably slower than both SASG and RGT. Figure 5 shows the estimated normal probability functions for the three methods. The probabilities P(X ≤ 0) for SASG, RGT, and IC can easily be calculated as 8.4 ∗ 10− 10, 2.3 ∗ 10− 6, and 1.7 ∗ 10− 6. Despite the fact that IC produces the highest quality solutions, its excessive required computational time prevents it from being selected. In effect, RC is more than 100 times slower than RGT and more than 60 times slower than SASG. Based on these considerations, RGT has been adopted as the construction module for the JSP.

Table 3 Comparing the performance of the three different construction methods for JSP
Fig. 5
figure 5

The estimated normal distribution functions for the three different construction methods for the JSP

For the QAP also, all of the three construction methods have also been compared and the best one has been adopted. The results have been shown in Table 4, and Fig. 6. In this table, %DEVavg shows the average deviation of solution Cost (C) from the Best Known Solution (BKS), computed as (CBKS)/BKS ∗ 100. Similarly, %STDEV has been calculated as 1N− 1 ∗∑ i= 1N(CiCAV G)2 ∗ 100 BKS. The columns under the titles of UNIFORM, RH and GREEDY show the uniform, greedy, and RH methods, respectively, described in Section 4.3. As can be seen, the GREEDY method, despite being relatively slower, shows superior performance. Likewise, assuming normal distribution, the values of P(X ≤ 0) for UNIFORM, RH, and GREEDY are 1.6 ∗ 10− 9, 4.9 ∗ 10− 11, and 6.1 ∗ 10− 6, respectively. Therefore, based on these considerations, GREEDY has been adopted as the construction module for the QAP.

Table 4 Comparing the performance of the three different construction methods for the QAP
Fig. 6
figure 6

The estimated normal distribution functions for the three different construction methods for the QAP

It should be noted that all the comparisons on initial solution construction methods for the PFSP, JSP and QAP, reported in this subsection, have been performed on the same running environment.

5.2 Parameter settings

Regardless of the problem the P3H applies to, the P3H needs totally 11 parameters for being adjusted. In setting these parameters, it has been noted that exploiting some high quality neighborhoods without effective exploration of solution space, and exploring solution space without effective exploitation of high quality neighborhoods is the easiest trap that an inappropriate parameter setting for the P3H can fall in. This theoretical principal is the main basis for manual setting of the P3H parameters. It is worth noting, while an adaptive (automatic) parameter tuning method [40] could be adopted, a manual parameter tuning, guided by balancing the exploration vs exploitation forces of the P3H is adopted and is explained as follows.

In effect, in setting these parameters for each of the three problems, a small number of instances, which with respect to their characteristics and sizes have had a varying degree of complexity, have been selected and different values for the parameters have been tested. Then the set of values which performed best have been chosen. In our parameter setting experiments, we found that many different settings of parameters achieved the same high quality performance.

This similar performance can be mainly explained by the robustness of the procedure. In setting the parameters, we noticed that it is based on the values of other parameters which the setting of a particular parameter finds its meaning. In effect, because of this interdependence, whereas in some contexts of the setting of other parameters, the lower values of a single parameter contribute to high quality solutions, in other contexts the reverse is true. This is partly due to the facts that (i) the performance of the P3H is mainly dependent on how it strikes balance between exploration and exploitation, and (ii) it is the combination of parameters that affects such balance.

In the cases where the increasing of a parameter, like perturbationProb or mutation, contributes to further exploration, and the increasing of the other parameter, like intensificationProb or concentrationProb, contributes to further exploitation, the setting of the first group of parameters can offset that of the second group and vice versa. Table 5 shows the setting of the parameters for each of the three problems, and it includes a UNIVERSAL column, proposing an initial starting value when applying P3H to other optimisation problems. This column is guided by our preliminary experiments as well as the average values for three problems under study. It is suggested that these universal starting values is adjusted based on both the exploration/exploitation considerations, explained in Section 4, and any relevant problem-specific knowledge.

Table 5 The setting of parameters for the PFSP, JSP, and QAP

5.3 Analyzing the effect of parallelization

Due to our Quad-Core processor and the capability of the P3H in working with different number of threads, we analyzed four variants of the P3H, shown as P3Hc, in which c shows the number of threads that are active. Although there is no limitation on the number of threads employed in the P3H, the Quad-Core processor has enforced the limit of four threads.

Each variant has been run 10 times with a problem-specific time limit for each run. The time limits have been set in such a way that the P3H has a maximum allowed running time equal or close to other algorithms in the literature. In this direction, for the PFSP and JSP, the time limit has been set to nm/10 and max (1.0, n m ∗ (9n − 60)) seconds for each run, where n and m are the number of jobs and machines, respectively. Also in solving the QAP, the time limit has been set to 110 ∗ 2N10 minutes, where N shows the instance size. For each variant, the best and average percent deviation from the best known solution (tightest upper bound), %DEVbest and %DEVavg, has been reported. Additionally, to provide a basic estimate on the time complexity of the P3Hc, its maximum allowed run time, with respect to various sample problem sizes have been reported in Table 6.

Table 6 Maximum allowed running times of the P3H, for the PFSP, JSP, and QAP, respectively

Table 7 shows the result of running all four variants on the 12 classes of Taillards’ instances. As can be seen, there is an improving trend with increasing the number of cores. For instance, we observe a 23% improvement in %DEVavg from 0.64 in the P3H1 to 0.49 in the P3H4.

Table 7 Analyzing the effect of the P3H parallelization on %DEVbest and %DEVavg for the PFSP

As shown in Table 8, similar comparisons have been made for the JSP. For some of the JSP instances, increasing the number of cores from 2 to 3 has non-improving or degrading effect on %DEVavg and %DEVbest. This may be due to the unpredictable randomized nature of the metaheuristics in general and parallel metaheuristics in particular which can lead to unexpected behaviors. In effect, the issue is more critical for parallel methods due to parallel computational overhead and possible memory race conditions for the shared variables. It should be noted that despite the increase of %DEVavg for some individual instances, the total %DEVavg for the P3H2 is equal to that of the P3H3, and %DEVbest is decreased from 0.34 to 0.30, indicating significant improvement for the rest of the instances.

Table 8 Analyzing the effect of the P3H parallelization on %DEVbest and %DEVavg for the JSP

For the QAP instances, Table 9 shows the related results. Again, an improving trend can be observed in both %DEVavg and %DEVbest. In effect, %DEVavg has been improved nearly by 25% through increasing the number of cores from 1 to 4.

Table 9 Analyzing the effect of the P3H parallelization on %DEVbest and %DEVavg for the QAP

Finally, in order to compare different variants of P3H in the same running environment, three variants of the P3H have been implemented and are run on the same CPU, each using a single thread/core, having identical maximum running time. These variants are P3HGA, P3HENHANCE, and P3H1. While in P3HGA the enhancing module of the P3H is disabled and only the genetic algorithm layer is active, in P3HENHANCE, only the enhancing layer, is active, and the GA layer is deactivated. In P3H1, while both GA and enhancing layers are active, the procedure operates on a single thread/core. The results are shown in Table 10. As can be seen, P3HENHANCE and P3H1 show a superior performance compared to P3HGA. This highlights the crucial contribution of the enhancing module to the overall success of the P3H.

Table 10 Comparing the performance of the three different variants of the P3H

5.4 Comparison with other metaheuristics

In this section, we compare the P3H4 with some of the highest performance metaheuristic in the literature. The comparisons are based on %DEVavg and the running time. It is worth noting that since different programming approaches have been used and various computational environments and threads have been employed, the reported running times have only informative purposes and any running time comparison needs to consider the discrepancies involved, from the number of threads through the type of processors to the categories of the programming environments.

Table 11 shows the result of comparing the P3H4 with several state-of-the-art metaheuristics. These methods include the Simulated Annealing algorithm, SAOP, of [49], the iterated local search (ILS) of [61], two Ant Colony Optimization algorithms, PACO and M-MMAS, proposed in [51], the Hybrid genetic algorithm, HGA_RMA, of [57], the Particle swarm optimization algorithm, PSOVNS of [68], the hybrid variable neighborhood search, NEGAVNS of [80], the Re-blocking Adjustable Memetic Procedure, RAMP, of [9], and the Refining Decomposition-based Integrated Search, RDIS, of [6].

Table 11 Comparison of the %DEVavg of P3H4 to that of other metaheuristics for the PFSP

As can be seen in Table 11, the P3H4 outperforms all other approaches on 20 × 20 and 50 × 20 problem groups. In addition, for 20 × 5, 20 × 10 and 100 × 20 instances, the P3H4 performance is equal to that of top performing methods. Furthermore, except for three problem groups with the deviation percentages of 1.3%, 1.69%, and 1.22%, the average deviation from the best known solution is always less than 0.6%.

Concerning the processor speeds, it should be mentioned that HGA_RMA, NEGAVNS, and PSOVNS have been run on PCs with 2.6 GHz, 2.4 GHz, and 2.8 GHz clock speeds respectively. Also both RDIS and RAMP have been run on a 2.2 GHz CPU. Table 12 shows the maximum allowed running time, in seconds, for different approaches. As can be seen, NEGAVNS, RDIS, RAMP, and the P3H4 have been allowed an equal time limit and HGA_RMA has the lowest running time. It should be noted that SAOP, ILS, M-MMAS, and PACO have all been re-implemented in [57], and hence they are run on the same processor, and share the same maximum allowed running time, as that of HGA_RMA.

Table 12 Comparison of the maximum allowed running times of the P3H4 to that of other metaheuristics for the PFSP

Further, since NEGAVNS, RAMP, RDIS, and P3H4 have identical maximum allowed running time, Friedman test, on their relative ranks, presented in Table 13, has been performed. An interested reader can see [26] for different tests applied to metaheuristics performance. It should be noted that ranks are calculated based on average deviation from the best known solution, %DEVavg. In the case of tied %DEVavg values, the average of the ranks is used. As can be seen, whereas RAMP and P3H4 have the lower ranks on smaller instances, NEGAVNS shows better performance on larger instance groups. Friedman test, performed on the ranked data set, shows a statistically significant difference of the ranks, with χ2(2) = 15.286 and p = 0.002.

Table 13 The ranks of NEGAVNS, RAMP, RDIS, and P3H4 based on average deviation from the best known solution

The performance of the procedure on the JSP has also been compared with several metaheuristics as shown in Table 14. These methods are (i) Small-Large Embedded Neighborhood Search (SLENP) of [7] (ii) Tabu-based Genetic Algorithm (TGA) of [8], (iii) Tabu Search and Simulated Annealing (TSSA) hyrid of [78], and Iterative Ejections of Bottleneck Operations (IEBO) procedure of [45]. The column Tavg shows the average running time, in seconds, to find the best solution for each method. Both SLENP and TGA have been run on a PC with 2.2 GHz CPU, TSSA has been run on a 3.0 GHz CPU, and IEBO has been run on a cluster with 2.93 GHz CPU cores.

Table 14 Comparison of the %DEVavg and Tavg of the P3H4 to that of other metaheuristics for the JSP

As shown in Table 14, in terms of deviation from the best known solution, the P3H4 outperform both SLENP and TGA, with an overall %DEVavg of 0.56%, compared to 1.11% and 1.28% deviations of SLENP and TGA, respectively. Moreover, on orb instances, P3H4 outperforms SLENP, TGA, and TSSA with a %DEVavg of 0.00% for all instances. IEBO has the lowest %DEVavg on swv and yn instances, at the expense of significantly higher running times, compared to that of three other methods.

With respect to the QAP, the performance of the P3H4 has been compared with that of the Progressive Adjusting Structural Solver (PASS) of [10], the Neighborhood Decomposition-based Search (NDS) of [11], Self-Adaptive Facility Interchange (SAFI) of [77], the Diversified Tabu Search (DivTS) of [39], and the Cooperative Parallel Tabu Search(CPTS) of [38]. Since each study, except for CPTS, have reported performance only on a subset of instances under study, it is difficult to make any conclusion based on overall averages. However, it can be seen that P3H4 outperforms PASS, NDS, SAFI, and ACO-GA/LS on sko instances. Overall, it seems the CPTS is the highest performing algorithm for the QAP at the time. The CPTS uses a parallel approach and has been run on 10 CPUs each having 1.3 GHz clock speed. In Table 15, the average running time (Tavg) of all algorithm, except for P3H4, has been reported, in minutes. However, since the P3H4 works based on time-limit stopping criterion, the maximum allowed time T\(_{\max \limits }\) has been reported. As is seen, while having a close overall performance to the CPTS, the P3H4 is significantly faster than it.

Table 15 Comparing the performance of the P3H4 with that of other metaheuristics with respect to %DEVavg and running time for the QAP

6 Concluding remarks

With respect to solving three hard-to-solve combinatorial optimization problems, the P3H has been both robust and effective, achieving 0.35%, 0.25%, and 0.07% overall average best deviation (%DEVbest) for the PFSP, JSP and QAP, respectively. This is due to two complementary facts. First, the employed threads use components in random order and in this way explore solution space effectively. Second, the threads cooperate by improving a common solution and this way exploit high quality areas of solution space. It is this balance between exploration and exploitation which has significantly contributed to the robustness and efficiency of the P3H. It is worth noting that the success of the P3H in solving the flowshop scheduling problem could have multiple practical implications and benefits with respect to total energy consumption in production systems [13].

Directions for overcoming limitations and further enhancing the procedure are as follows. First, the parallel strategy employed can be enhanced by creating a pool of high quality solutions and providing access to the elements of this pool for all of the threads. In this way, the threads fill and cooperatively improve the same pool. Second, a feedback mechanism can be incorporated into the procedure to take the responsibility of further balancing exploration with exploitation. This balancing can be done by embedding a learning capability in the procedure to adjust the parameters while a problem is being solved, and can be of paramount importance.