Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

Differential evolution (DE) is a popular meta-heuristic optimization algorithm belonging to the wide family of evolutionary algorithms (EAs). As with many other evolutionary algorithms, it aims to solve the optimization problems by a simulated evolution of a population of candidate solutions. The population of candidates evolved by the algorithm performs a massively parallel search through the problem domain towards globally optimal solutions. The candidate solutions can be seen as individual points on the fitness landscape of the solved problem that are iteratively and in many directions at once moved towards promising regions on the fitness landscape. The implicit parallelism of the algorithm makes it even more interesting for an implementation on a truly parallel platform.

In this chapter, we present a fine-grained implementation of DE designed specifically for super parallel SIMD (single-instruction multiple-data) devices such as the GPUs. The SIMD hardware found in the modern day GPUs supports parallel execution of hundreds of threads at the same time and the software drivers and runtime libraries allow efficient scheduling of tens of thousands of threads. General-purpose computing on graphics processing units (GPGPU) can use either commodity hardware such as the GPUs used primarily for computer graphics and entertainment or GPU coprocessors designed to perform massively parallel computations in the first place.

This chapter is organized in the following way: in Sect. 2, the basic principles of differential evolution and some of its applications are presented. Section 3 gives a brief overview of GPU computing, the CUDA platform, and recent implementations of differential evolution on GPUs. Many-threaded differential evolution is presented in Sect. 4, and its performance on selected test problems, first reported in [16] and [17], is described in Sect. 5.

2 Differential Evolution

Differential evolution (DE) is a versatile and easy to use stochastic evolutionary optimization algorithm [29]. It is a population-based optimizer that evolves a population of real-encoded vectors representing the solutions to given problems. DE was introduced by Storn and Price in 1995 [38, 39], and it quickly became a popular alternative to the more traditional types of evolutionary algorithms. It evolves a population of candidate solutions by iterative modification of candidate solutions by the application of differential mutation and crossover [29]. In each iteration, the so-called trial vectors are created from the current population by differential mutation and further modified by various types of crossover operator. At the end, the trial vectors compete with existing candidate solutions for survival in the population.

2.1 The DE Algorithm

DE starts with an initial population of N real-valued vectors. The vectors are initialized with real values either randomly or so that they are evenly spread over the problem space. The latter initialization leads to better results of the optimization [29].

During the optimization, DE generates new vectors that are scaled perturbations of existing population vectors. The algorithm perturbs selected base vectors with the scaled difference of two (or more) other population vectors in order to produce the trial vectors. The trial vectors compete with members of the current population with the same index called the target vectors. If a trial vector represents a better solution than the corresponding target vector, it takes its place in the population [29].

There are two most significant parameters of DE [29]. The scaling factor F ∈ [0, ] controls the rate at which the population evolves and the crossover probability C ∈ [0, 1] determines the ratio of bits that are transferred to the trial vector from its opponent. The size of the population and the choice of operators are other important parameters of the optimization process.

The basic operations of classic DE can be summarized using the following formulas [29]: the random initialization of the ith vector with N parameters is defined by

$$\displaystyle{ x_{i}[j] = \mathit{rand}(b_{j}^{L},b_{ j}^{U}),\ \ j \in \{ 0,\ldots,N - 1\} }$$
(1)

where b j L is the lower bound of the jth parameter, \(b_{j}^{U}\) is the upper bound of the jth parameter and rand(a, b) is a function generating a random number from the range [a, b]. A simple form of differential mutation is given by

$$\displaystyle{ v_{i}^{t} = v_{ r1} + F(v_{r2} - v_{r3}) }$$
(2)

where F is the scaling factor and v r 1, \(v_{r}^{2}\), and \(v_{r}^{3}\) are three random vectors from the population. The vector v r1 is the base vector, v r2 and v r3 are the difference vectors, and the ith vector in the population is the target vector. It is required that ir1 ≠ r2 ≠ r3. The differential mutation in 2D (i.e., for N = 2) is illustrated in Fig. 1. The uniform crossover that combines the target vector with the trial vector is given by

$$\displaystyle{ l = \mathit{rand}(0,N - 1) }$$
(3)
$$\displaystyle\begin{array}{rcl} v_{i}^{t}[m]& = \left \{\begin{array}{@{}l@{\quad }l@{}} v_{i}^{t}[m]\quad &{\mbox{if}}\ (\mathit{rand}(0,1) <C)\ {\mbox{or}}\ m = l \\ x_{i}[m] \quad \end{array} \right.&{}\end{array}$$
(4)

for each \(m \in \{ 1,\ldots,N\}\). The uniform crossover replaces with probability 1 − C the parameters in v i t by the parameters from the target vector x i .

Fig. 1
figure 1

Differential mutation

The outline of classic DE according to [10] is summarized in Algorithm 1. However, the monograph on DE by Price, Storn, and Lampinen [29] lists a different version of the basic DE. They first form a whole new population of trial vectors P t and subsequently merge P and P t. It means that the newly created trial vectors do not enter the population of candidate solutions P immediately and therefore cannot participate in the creation of next trial vectors until the whole population was processed.

There are also many other modifications to the classic DE. Mostly, they differ in the implementation of particular DE steps such as the initialization strategy, the vector selection, the type of differential mutation, the recombination operator, and control parameter selection and usage [10, 29].

The initialization strategy affects the way vectors in the initial population are placed in the problem space. In general, a better initial coverage of the problem space represents a better starting point for the optimization process because the vectors can explore various regions of the fitness landscape from the very beginning [29].

The selection strategy defines how are the target vector, the base vector, and the difference vectors selected. Moreover, it has an effect on the time each vector survives in the population, which can be given either by the age of the vector or by the fitness of the vector. Popular base vector selection strategies are the random selection and methods based on stochastic universal sampling [3, 29]. The random selection without restrictions allows the same vector to be used as a base vector more than once in each generation. The methods based on stochastic universal sampling ensure that all vectors are used as base vectors exactly once in each generation. The selection methods based on the stochastic universal sampling generate a permutation of vector indexes that defines which base vector will be coupled with which target vector [29]. An alternative base vector selection strategy is called the biased base vector selection. The biased base vector selection uses the information about the fitness value of each vector when selecting base vectors. The biased base vector selection strategies include the best-so-far base vector selection (the best vector in the population is always selected as base vector) and target-to-best base vector selection (the base vector is an arithmetic recombination of the target vector and the best-so-far vector) [29]. The biased base vector selection schemes introduce a more intensive selection pressure which can, as in other evolutionary techniques, result in faster convergence but it can also lead to a loss of diversity in the population.

The differential mutation is the key driver of DE. It creates a trial vector as a recombination of base vector and scaled difference of selected difference vectors. The scaling factor F, which modifies vector differences, can be a firmly set constant, a random variable selected according to some probability distribution, or defined by some other function as, e.g., in self-adaptive DE variants. A variable scaling factor increases the number of vector differentials that can be generated given the population P [29]. In general, a smaller scaling factor causes smaller steps in the fitness landscape traversal while a greater scaling factor causes larger steps. The former leads to longer time for the algorithm to converge, and the latter can cause the algorithm to miss the optima [29].

The recombination (crossover) operator plays a special role in DE. It achieves a similar goal as the mutation operator in other evolutionary algorithms, i.e., it controls the introduction of new material to the population using a mechanism similar to the n-point crossover in traditional EAs. The crossover probability C ∈ [0, 1] defines the probability that a parameter will be inherited from the trial vector. Similarly, C − 1 is the probability that the parameter will be taken from the target vector. Crossover probability has also a direct influence on the diversity of the population [10]. An increased diversity initiated by larger C means a more intensive exploration and faster convergence of the algorithm at the cost of the robustness of the algorithm [29]. Common DE crossover operators include exponential crossover and the uniform crossover. Some other crossover operators are, e.g., arithmetic crossover and either-or-crossover [10, 29].

2.2 DE Variants and Applications

Particular DE variants are often referred to using a simple naming scheme [10, 29] that uses the pattern DE/x/y/z. In the pattern, x describes the base vector selection type, y represents the number of differentials, and z is the type of crossover operator. For example, classic DE is often called DE/rand/1/bin, which means that it uses random base vector selection, a single vector differential, and uniform (binominal) crossover. Some other variants of DE described in the literature are [10]:

  1. 1.

    DE/best/1/*, which always select the so far best vector as the base vector and its differential mutation can be described by

    $$\displaystyle{ v_{i}^{t} = v_{\mathrm{ best}} + F(v_{r2} - v_{r3}) }$$
    (5)

    This DE variant can use any type of the crossover operator.

  2. 2.

    DE/*/n v /* that uses arbitrary base vector selection strategy, n v differential vectors, and type of the crossover operator. Its differential mutation can be described by

    $$\displaystyle{ v_{i}^{t} = v_{ r1} + F\sum _{k=1}^{n_{v} }(v_{r2}^{k} - v_{ r3}^{k}) }$$
    (6)

    where \(v_{r2}^{k} - v_{r3}^{k}\) is the kth vector differential.

  3. 3.

    DE/rand-to-best/n v /* combines random base vector selection with the best base vector selection strategy:

    $$\displaystyle{ v_{i}^{t} =\gamma v_{\mathrm{ best}} + (1-\gamma )v_{r1} + F\sum _{k=1}^{n_{v} }(v_{r2}^{k} - v_{ r3}^{k}) }$$
    (7)

    It uses the parameter γ ∈ [0, 1] to control the exploitation of the mutation. The larger γ the larger the exploitation. The parameter γ can be also adaptive.

  4. 4.

    DE/current-to-best/1+n v /* uses at least two difference vectors. The first differential participating in the mutation is computed between the best vector and the base vector and all the other differentials are computed using randomly selected vectors:

    $$\displaystyle{ v_{i}^{t} = v_{ r1} + F(v_{\mathrm{best}} - v_{r1}) + F\sum _{k=1}^{n_{v} }(v_{r2}^{k} - v_{ r3}^{k}) }$$
    (8)

DE is a very successful algorithm with a number of applications. DE was used, among others, for merit analysis, for non-imaging optical design, for the optimization of industrial compressor supply systems, for multi-sensor fusion, for the determination of earthquake hypocenters, for 3D medical image registration, for the design of erasure codes, for digital filters, for the analysis of X-ray reflectivity data, to solve the inverse fractal problem, and for the compensation of RF-driven plasmas [29]. It was also used for evolutionary clustering [8] and to optimize the deployment of sensor nodes [34]. Despite its continuous nature, the DE was used also to solve combinatorial optimization problems. The applications of DE in this domain included turbo code interleaver optimization [19], scheduling of independent tasks in heterogeneous computing environments [15, 20], search for optimal solutions to the linear ordering problem [37], and many others.

DE is a successful evolutionary algorithm designed for continuous parameter optimization driven by the idea of scaled vector differentials. That makes it an interesting alternative to the widespread genetic algorithms that are designed to work primarily with discrete encoding of the candidate solutions. As well as genetic algorithms, it represents a highly parallel population-based stochastic search meta-heuristic. In contrast to GA, differential evolution uses the real encoding of candidate solutions and different operations to evolve the population. It results in different search strategies and different directions found by DE when crawling a fitness landscape of the problem domain.

3 GPU Computing

Modern graphics hardware has gained an important role in the area of parallel computing. GPUs have been used to power gaming and 3D graphics applications, but recently they have been used to accelerate general computations as well. The new area of general-purpose computing on graphics processing units (GPGPU) has been flourishing since then. The data parallel architecture of GPUs is suitable for vector and matrix algebra operations, which leads to the wide use of GPUs in the area of scientific computing with applications in information retrieval, data mining, image processing, data compression, etc.

To simplify the development of GPGPU programs, various vendors have introduced languages, libraries, and tools to create parallel code rapidly. The GPU platform and API developed by nVidia is called CUDA (Compute Unified Device Architecture). It is based on the CUDA-C language, which is an extension to C that allows development of GPU routines called kernels. Each kernel defines instructions that are executed on the GPU by many threads at the same time following the SIMD model. The threads can be organized into so-called thread groups that can benefit from GPU features such as fast shared memory, atomic data manipulation, and synchronization. The CUDA runtime takes care of the scheduling and execution of the thread groups on available hardware. The set of thread groups requested to execute a kernel is called in CUDA terminology a grid. A kernel program can use several types of memory: fast local and shared memory, large but slow global memory, and fast read-only constant memory and texture memory. The structure of CUDA program execution and the relation of threads and thread groups to device memory is illustrated in Fig. 2.

Fig. 2
figure 2

CUDA-C program structure and memory hierarchy [26]

GPU programming has established a new platform for evolutionary computation [9]. The majority of evolutionary algorithms, including genetic algorithms (GA) [28], genetic programming (GP) [21, 33], and DE [40, 42, 43], have been implemented on GPUs. Most of the contemporary implementations of evolutionary algorithms on GPUs map each candidate solution in the population to a single GPU thread. However, recent work in the area of evolutionary computation on GPUs has introduced further parallelization of GA by, e.g., many-threaded implementation of the crossover operator and local search [12].

3.1 Differential Evolution on the GPU

Due to the simplicity of its operations and real encoding of the candidate solutions, DE is suitable for parallel implementation on the GPUs. In DE, each candidate solution is represented by a vector of real numbers (parameters), and the population as a whole can be seen as a real-valued matrix. Moreover, both the mutation operator and the crossover operator can be implemented easily as straightforward vector operations.

The first implementation of DE on the CUDA platform was introduced in the early 2010 by de Veronese and Krohling [40]. Their DE was implemented using the CUDA-C language, and it achieved speedup between 19 and 34 times comparing to the CPU implementation on a set of benchmarking functions. The generation of random numbers was implemented using the Mersenne Twister from the CUDA SDK, and the selection of random trial vectors for mutation was done on the CPU.

Zhu [42], and Zhu and Li [43] implemented DE on CUDA as part of a differential evolution-pattern search algorithm for bound-constrained optimization problems and as part of a differential evolutionary Markov chain Monte Carlo method (DE-MCMC), respectively. In both cases, the performance of the algorithms was demonstrated on a set of continuous benchmarking functions.

The common property of the above DE implementations is the mapping of a single GPU thread to one candidate solution and the usage of the Mersenne Twister from the CUDA SDK for random number generation on the GPU. Moreover, some parts of the random number generation process were [42] offloaded to the CPU. In this work, we use a new implementation of DE on the CUDA platform using many threads to process each candidate solution and utilizing the GPU to generate random numbers needed for the optimization.

Fig. 3
figure 3

The flowchart of the DE implementation on CUDA

4 Many-Threaded Differential Evolution on the GPU

The goal of the implementation of DE on the CUDA platform was achieving high parallelism while retaining the simplicity of the algorithm. The implementation consists of a set of CUDA-C kernels for generation of initial population, generation of batches of random numbers for the decision making, DE processing including generation of trial vectors, mutation and crossover, verification of the generated vectors, and the merger of parent and offspring populations. Besides these kernels implementing DE, an implementation of the fitness function evaluation was done in a separate kernel. The overview of the presented DE implementation is shown in Fig. 3. The kernels were implemented using the following principles:

  1. 1.

    Each candidate solution is processed by a thread block (thread group). The number of thread groups is in nVidia CUDA 4.0 limited to (216 − 1)3 and in earlier versions to \({({2}^{16} - 1)}^{2}\). Hence, the maximum population size is in this case the same.

  2. 2.

    Each vector parameter is processed by a thread. The limit of threads per block depends in CUDA on the hardware compute capability, and it is 512 for compute capability 1.x and 1024 for compute capability 2.x [26]. This limit enforces the maximum vector length. This is enough for the application area considered in this paper. The mapping of CUDA threads and thread blocks to the DE vectors is illustrated in Fig. 4.

  3. 3.

    Each kernel call aims to process the whole population in a single step, e.g., it asks the CUDA runtime to launch M blocks with N threads in parallel. The CUDA runtime executes the kernel with respect to available resources.

Such an implementation brings several advantages. First, all the generic DE operations can be considered done in parallel, and, thus, their complexity reduces from M × N (population size multiplied by vector length) to c (constant, duration of the operation plus CUDA overhead). Second, this DE operates in a highly parallel way also on the logical level. A population of offspring chromosomes of the same size as the parent population is created in a single step and later merged with the parent population. Third, the evaluation of vectors is accelerated by the implementation of the fitness function on GPU.

Fig. 4
figure 4

The mapping of CUDA threads and thread blocks on DE population

5 The Performance of the Many-Threaded DE on the GPU

The performance of many-threaded DE was evaluated on several test problems. To perform the experiments, the DE/rand/1/bin type of DE was implemented for both, the CUDA platform and sequential execution on the CPU. When not stated otherwise, the presented experiments were performed on a server with two dual core AMD Opteron processors at 2.6 GHz and an nVidia Tesla C2050 with 448 cores at 1.15 GHz.

In contrast to the majority of DE applications, the many-threaded DE on the GPU was used to solve combinatorial optimization problems. However, to provide at least indirect comparison with previous DE implementations on CUDA, the optimization of the test function f 2 was performed.

5.1 Function Optimization

The previous GPU based DE implementations were most often tested using a set of continuous benchmarking functions. To provide at least a rough comparison of our approach to another DE variant, we have implemented a DE searching for the minimum of the test function f 2 from [40]:

$$\displaystyle{ f_{2}(x) = \sum _{i-1}^{n}\left (x_{ i}^{2} - 10\cos (2\pi x_{ i}) + 10\right ) }$$
(9)

We note that the comparison is indirect and rather illustrative since the two algorithms were executed for the same test function but with different dimensions, on different hardware, and with different settings.

The purpose of this comparison is to show whether the proposed many-threaded DE can find similar, better, or worse solutions of a previously used benchmark function.

Table 1 Comparison of the many-threaded DE with CUDA-C implementation from [40]

From Table 1, it can be seen that many-threaded DE has found in a very similar time frame a solution to f 2 with better (i.e., lower) fitness value. Many-threaded DE was executed on a more powerful GPU than DE in [40], but it had to solve a function with 1.28 times larger dimension. In 0.64 s, it delivered approximately 1.19 times better (in terms of fitness value) solution, and in 27 s it had found an approximately three times better solution than the previous implementation.

This leads us to the conclusion that the proposed DE is able to find a good minimum of a continuous functions and it appears to be competitive compared with previous CUDA-C implementations.

5.2 Linear Ordering Problem

The linear ordering problem (LOP) is a well-known NP-hard combinatorial optimization problem. It has been intensively studied and there are plenty of exact, heuristic, and meta-heuristic algorithms for LOP. With its large collection of well-described testing data sets, the LOP represents an interesting testbed for meta-heuristic algorithms for combinatorial optimization [22, 23].

The LOP can be formulated as a graph problem [22]. For a complete directed graph \(D_{n} = (V _{n},A_{n})\) with weighted arcs c ij , compute a spanning acyclic tournament T in A n such that \(\sum _{(i,j)\in T}c_{\mathit{ij}}\) is as large as possible. The LOP can also be defined as a search for an optimal column and row reordering of a weight matrix C [6, 22, 35, 36]. Consider a matrix C n×n, permutation Π, and a cost function f:

$$\displaystyle{ f(\varPi ) =\sum _{ i=1}^{n}\sum _{ j=i+1}^{n}c_{\varPi (i)\varPi (j)} }$$
(10)

The LOP is a search for permutation Π so that f(Π) is maximized, i.e., the permutation restructures the matrix C so that the sum of its elements above the main diagonal is maximized. The LOP is an NP-hard problem with a number of applications in scheduling (scheduling with constraints), graph theory, economy, sociology (paired comparison ranking), tournaments, and archaeology among others.

In economics, LOP algorithms are deployed to triangularize input-output matrices. The resulting permutation provides useful information on the stability of the investigated economy. In archaeology, LOP algorithms are used to process the Harris Matrix, a matrix describing most probable chronological ordering of samples found in different archaeological sites [36]. Other applications of LOP include the equivalent graph problem, the related graph problem, the aggregation of individual preferences, ranking in sports tournaments, and the minimization of crossing [22].

A variant of the LOP is the linear ordering problem with cumulative costs (LOPCC) that has applications in the optimization of the universal mobile telecommunication standard (UMTS) in mobile phone telecommunication systems [4, 22, 30].

5.2.1 LOP Data Sets

There are several test libraries used for benchmarking LOP algorithms. They are well preprocessed and thoroughly described, and the optimal (or so-far best) solutions are available. The majority of the investigated algorithms were tested against the LOLIB library. The original LOLIB library contains 49 instances of input-output matrices describing European economies in the 1970s. Optimal solutions of the LOLIB matrices are available. Although the LOLIB contains real-world data, it is considered rather simple and easy to solve [31]. Mitchell and Bochers [24] have published an artificial LOP data library and a LOP instance generator. The data (MBLB) and code are available from Rensselaer Polytechnic Institute.Footnote 1

Schiavinotto and Stützle [35, 36] have shown that the LOLIB and MBLB instances are significantly different, having diverse high-level characteristics of the matrix entries such as sparsity or skewness. The search space analysis revealed that MBLB instances typically have higher correlation length and also a generally larger fitness-distance correlation than LOLIB instances. It suggests that MBLB instances should be easier to solve than LOLIB instances of the same dimension. Moreover, a new set of large artificial LOP instances (based on LOLIB) called XLOLIB was created and published. Another set of LOP instances is known as the Stanford GraphBase (SGB). The SGB is composed of larger input-output matrices describing the US economies.

Many LOP libraries are hosted by the Optsicom project.Footnote 2 The Optsicom archive contains LOLIB, SGB, MBLB, XLOLIB, and other LOP instances. One important feature of the Optsicom LOP archive is that its LOP matrices are normalized [22], i.e., they were preprocessed so that:

  • All matrix entries are integral.

  • c ii  = 0 for all \(i =\{ 1,2,\ldots,n\}\).

  • \(\mathit{min}\{c_{\mathit{\mathit{ij}}},c_{\mathit{ji}}\} = 0\) for all 1 ≤ i < j ≤ n.

5.2.2 LOP Algorithms

There are several exact and heuristic algorithms for the linear ordering problem. The exact algorithms are strongly limited by the fact that LOP is a NP-hard problem (i.e., there are no exact algorithms that could solve LOP in polynomial time). Among the exact algorithms, branch & bound approach based on LP relaxation of the LOP for the lower bound, a branch & cut algorithm, and interior point/cutting plane algorithm attracted attention [36]. Exact algorithms are able to solve rather small general instances of the LOP and bigger instances (with the dimension of few hundred rows and columns) of certain classes of LOP [36].

A number of heuristic algorithms, including the greedy algorithms, local search, elite tabu search, scattered search, and iterated local search, were used to solve the LOP instances [13, 22, 36]. In this work, we use the LOP as a testbed for the performance evaluation of a many-threaded DE implementation powered by the GPU.

The LOP candidate solutions were for the purpose of the DE represented using the random keys encoding [2]. With this encoding, the candidate solution consists of an array of real numbers. The change in the order of elements of the array after sorting corresponds to a permutation. Due to its simplicity, the random keys encoding is a natural choice for differential evolution of permutations.

Computational experiments were performed on a server described in the beginning of this section. For comparison, the LOP was also computed on a laptop with Intel Core i5 at 2.3 GHz. Although the CPUs used in the experiments were multicore, the LOP evaluation was single-threaded. The comparison of fitness computation times for different population sizes is illustrated in Fig. 5 (note the log scale). In the benchmark, a population of candidate solutions of a LOP instance with the dimension 50 was evaluated on the GPU and on two CPUs.

Fig. 5
figure 5

The speed of LOP evaluation on the CPUs and on the GPU

We can see that the Core i5 is always 3.3–4.2 times faster than the Opteron. The Tesla C2050 is equally fast as the Opteron when evaluating 16 LOP candidates and 1.4–9.7 times faster for larger LOP candidate populations. The Core i5 is faster than GPU for population sizes 16, 32, and 64. The GPU performs similarly as Core i5 when evaluating 128 LOP candidates and 1.4–2.4 times faster for larger populations.

5.2.3 Search for LOLIB Solutions on the GPU

The LOLIB solutions obtained by the DE on CUDA and by a sequential DE implementation are shown in Table 2. The table contains best and average error after 5 s and 10 s computed from 10 independent optimization runs for each LOP matrix. We can see that the largest average error after 5 s is 0. 232 for matrix t69r11xx, which is better than the best LOP solution found by the DE in [37]. The average error for all LOLIB matrices was 0. 043 after 5 s and 0. 031 after 10 s.

Table 2 The average error of LOLIB solutions (in percent)

We have compared the progress of DE for LOP on the GPU and on the CPU. A typical example of the optimization is shown in Fig. 6. Apparently, DE on the GPU delivers optimal or nearly optimal results very quickly compared to the CPU.

Interestingly, the results of the optimization after 10 s were sometimes worse than the results of optimization after 5 s (the results were obtained in separate program runs). It suggests that DE on CUDA quickly converges to a suboptimal solution but sometimes fails to find global optimum. On the other hand, the algorithm has found global optimum in all test runs for 15 out of 49 LOP matrices, which is a good result for a pure meta-heuristic.

Fig. 6
figure 6

Example of DE for LOLIB matrix be75eec on CPU and GPU

5.2.4 Search for N-LOLIB Solutions

DE on CUDA was also used to find solutions to the normalized LOLIB (N-LOLIB) library. The results of DE for LOP implemented on the GPU are shown in Table 3. The average error for all N-LOLIB matrices was 0. 034 after 5 s and 0. 037 after 10 s. Moreover, the DE has found the optimal solution for 18 out of 50 LOP instances.

When we compare these results to N-LOLIB results obtained recently by several meta-heuristic methods in [22], we can see that the differential evolution performs better than pure genetic algorithms with average error 0. 38 and optimal results found for 9 matrices. On the other hand, other meta-heuristics, including tabu search, memetic algorithms, and simulated annealing, performed better. However, we have to note that the DE used in this work is a pure meta-heuristic and uses no domain knowledge or local search to improve the solutions.

Table 3 The average error of N-LOLIB solutions (in percent)

5.3 Independent Task Scheduling

In grid and distributed computing, the mixed-machine heterogeneous computing (HC) environments utilize a distributed suite of different machines to perform different computationally intensive applications that have diverse requirements [1, 5]. Task scheduling, i.e., mapping of a set of tasks to a set of resources, is required to exploit the different capabilities of a set of heterogeneous resources. It is known that an optimal mapping of computational tasks to available machines in a HC suite is an NP-complete problem [11], and, as such, it is a subject to various heuristic [5, 14, 25] and meta-heuristic [7, 27, 32, 41] algorithms, including differential evolution [18].

A HC environment is a composite of computing resources (PCs, clusters, or supercomputers). Let \(T =\{ T_{1},T_{2},\ldots,T_{n}\}\) denote the set of tasks that is in a specific time interval submitted to a resource management system (RMS). Assume the tasks are independent of each other with no inter-task data dependencies and preemption is not allowed (the tasks cannot change the resource they have been assigned to). Also assume at the time of receiving these tasks by RMS, m machines \(M =\{ M_{1},M_{2},\ldots,M_{m}\}\) are within the HC environment. For our purpose, scheduling is done on the machine level, and it is assumed that each machine uses first-come, first-served (FCFS) method for performing the received tasks. We assume that each machine in the HC environment can estimate how much time is required to perform each task. In [5], the expected time to compute (ETC) the matrix was used to estimate the required time for executing a task in a machine. An ETC matrix is an n × m matrix in which n is the number of tasks and m is the number of machines. One row of the ETC matrix contains the estimated execution time for a given task on each machine. Similarly, one column of the ETC matrix consists of the estimated execution time of a given machine for each task. Thus, for an arbitrary task T j and an arbitrary machine M i , [ETC] j, i is the estimated execution time of T j on M i . In the ETC model we take the usual assumption that we know the computing capacity of each resource, an estimation or prediction of the computational needs of each job, and the load of prior work of each resource.

The two objectives to optimize during the task mapping are makespan and flowtime. Optimum makespan (meta-task execution time) and flowtime of a set of jobs can be defined as:

$$\displaystyle\begin{array}{rcl} \mathit{makespan}& =& \min _{S\in \mathit{Sched}}\{\max _{j\in \mathit{Jobs}}F_{j}\}{}\end{array}$$
(11)
$$\displaystyle\begin{array}{rcl} \mathit{flowtime}& =& \min _{S\in \mathit{Sched}}\{\sum _{j\in \mathit{Jobs}}F_{j}\}{}\end{array}$$
(12)

where Sched is the set of all possible schedules, Jobs stands for the set of all jobs to be scheduled, and F j represents the time in which job j finalizes. Assume that C ij \((j = 1,2,\ldots,n,i = 1,2,\ldots,m)\) is the completion time for performing the j-th task in the i-th machine and W i \((i = 1,2,\ldots,m)\) is the previous workload of M i , then \(\sum _{j\in S(i)}C_{\mathit{ij}} + W_{i}\) is the time required for M i to complete the tasks included in it (S(i) is the set of jobs scheduled for execution on M i in schedule S). According to the aforementioned definition, makespan and flowtime can be evaluated using:

$$\displaystyle\begin{array}{rcl} \mathit{makespan}& =& \max _{i\in \{1,2,\ldots,m\}}\{\sum _{j\in S(i)}C_{\mathit{ij}} + W_{i}\}{}\end{array}$$
(13)
$$\displaystyle\begin{array}{rcl} \mathit{flowtime}& =& \sum _{i=1}^{m}\sum _{ j\in S(i)}C_{\mathit{ij}}{}\end{array}$$
(14)

Minimizing makespan aims to execute the whole meta-task as fast as possible while minimizing flowtime aims to utilize the computing environment efficiently.

A schedule of n independent tasks executed on m machines can be naturally expressed as a string of n integers \(S = (s_{1},s_{2},\ldots,s_{n})\) that are subject to \(s_{i} \in 1,\ldots,m\). The value at the i position in S represents the machine on which the i-th job is scheduled in schedule S. Since differential evolution uses for problem encoding real vectors, real coordinates must be used instead of discrete machine numbers. The real-encoded DE vector is translated to schedule representation by simple truncation of its coordinates (e.g., 3. 6 → 3, 1. 2 → 1). Assume schedule S from the set of all possible schedules Sched. For the purpose of differential evolution, we define a fitness function \(\mathit{fit}(S): \mathit{Sched} \rightarrow \mathbb{R}\) that evaluates each schedule:

$$\displaystyle{ \mathit{fit}(S) = \lambda \cdot \mathit{makespan}(S) + (1-\lambda ) \cdot \frac{\mathit{flowtime}(S)} {m} }$$
(15)

The function fit(S) is a sum of two objectives, the makespan of schedule S and flowtime of schedule S divided by the number of machines m to keep both objectives in approximately the same magnitude. The influence of makespan and flowtime in fit(S) is parametrized by the variable λ. The same schedule evaluation was used also in [7].

5.3.1 Search for Optimal Independent Task Schedules

We have implemented DE for scheduling of independent tasks on the CUDA platform to evaluate the performance and quality of the proposed solution. The GPU implementation was compared to a simple CPU implementation (high-level object-oriented C\(++\) code) and optimized CPU implementation (low-level C code to achieve maximum performance). The optimized CPU implementation was created to provide a fair comparison of performance oriented implementations on the GPU and on the CPU. Optimized CPU and GPU implementations of the DE for scheduling optimization were identical with the exception of the CUDA-C language constructions.

First, the time needed to compute the fitness for the population of DE vectors was measured for all three DE implementations. The comparison of the fitness computation times ON for different population sizes is illustrated in Fig. 7 (note the log scale of both axes).

The GPU implementation was 25.2–216.5 times faster than the CPU implementation and 2.2–12.5 times faster than the optimized CPU implementation of the same algorithm. This, along with the speedup achieved by the parallel implementation of the DE operations, contributes to the overall performance of the algorithm.

Fig. 7
figure 7

Comparison of schedule evaluation time on CPU and GPU

To compare the GPU based and the CPU based DE implementations for the independent task scheduling, we have used the benchmark proposed in [5]. The benchmark contains several ETC matrices for 512 jobs and 16 machines. The matrices are labeled according to the following properties [5]:

  • Task heterogeneityV task represents the amount of variance among the execution times of tasks for a given machine.

  • Machine heterogeneityV machine represents the variation among the execution times for a given task across all the machines.

  • Consistency—an ETC matrix is said to be consistent whenever a machine M j executes any task T i faster than machine M k ; in this case, machine M j executes all tasks faster than machine M k .

  • Inconsistency—machine M j may be faster than machine M k for some tasks and slower for others.

Table 4 The fitness of best schedule found in 30 s using different population sizes (lower is better)
Fig. 8
figure 8

Fitness improvement of different DE implementations for ThM*C* matrices. (a) ThMhCc, (b) ThMhCi, (c) ThMhCs, (d) ThMlCc, (e) ThMlCi, (f) ThMlCs

Fig. 9
figure 9

Fitness improvement of different DE implementations for TlM*C* matrices. (a) TlMhCc, (b) TlMhCi, (c) TlMhCs, (d) TlMlCc, (e) TlMlCi, (f) TlMlCs

Each ETC matrix is named using the pattern TxMyCz, where x describes task heterogeneity (high or low), y describes machine heterogeneity (high or low), and z describes the type of consistency (inconsistent, consistent, or semi-consistent).

We have investigated the speed and quality of the results obtained by the proposed DE implementation and compared it to the results obtained by CPU implementations. Average fitness values of the best schedules found by different DE variants after 30 s are listed in Table 4. The best results for each ETC matrix are shown in bold. We can see that the GPU implementation delivered the best results for population sizes 1024 and 512. However, the most successful population size was 64. Apparently, such a population size seems to be suitable for the investigated scheduling problem with given dimensions (i.e., number of jobs and number of machines). When executing differential evolution with population size 64, the optimized CPU implementation delivered the best results for the consistent ETC matrices, i.e., ThMhCc, ThMlCc, TlMhCc, and TlMlCc. In all other cases, the best result was found by the GPU powered differential evolution.

The progress of DE with the most successful population size 64 for different ETC matrices is shown in Figs. 8 and 9. The figures clearly illustrate the big difference between DE on the CPU and the GPU. DE executed on the GPU achieves the most significant fitness improvement during the first few seconds (roughly 5 s), while the CPU implementations require much more time to deliver solutions with similar quality, if they manage to do it at all. Needless to say, the optimized CPU implementation always found better solutions than the simple CPU optimization because it managed to process more candidate vectors in the same time frame.

6 Conclusions

This chapter described the design and implementation of a fine-grained DE on the GPUs. The basic steps of the algorithm were implemented with respect to the super parallel SIMD architecture of the GPUs allowing efficient parallel execution of hundreds of threads. The fine-grained many-threaded DE design was chosen in order to maximize the utilization of the resources provided by the GPUs.

The performance of the solution was demonstrated on a series of computational experiments. The experimental evaluation involved continuous function optimization to provide a rough comparison with a previous DE design for the GPU and two popular combinatorial optimization problems with real-world applications. The many-threaded DE implemented on the CUDA platform has shown good performance and good results in all test cases.