1 Introduction

The increasing computational demand of the next generation applications has driven computer designers to adopt new approaches in designing and constructing large High Performance Computing platforms (HPC), sparking the development and deployment of new technologies. Those technologies include the use of multicore and/or many-core architecture such as GPUs or the modern Xeon Phi platforms and multi-GPU clusters to speed up algorithms with high computational requirements.

The use of evolutionary multiobjective optimization (EMO) algorithms to solve large-scale multiobjective problems has been limited due to its large computational burden. However, thanks to HPC techniques, these kinds of algorithms can be used for solving multiobjective problems with many objectives and/or with large number of variables within a reasonable amount of time with an optimal energy consumption. Examples of such applications are found, e.g., in finance (large scale asset allocation problems) [22] in which thousands or even tens of thousands of variables must be balanced. As consequences, large populations must be used to approximate the Pareto front. Some studies where HPC techniques are developed to solve large-scale EMO problems are [10, 14, 21, 28].

Commonly, parallel implementations of EMO algorithms are focused on the distributions of evaluations of objective functions. The Non-Dominated Sorting (NDS) is the most time-consuming procedure used in the majority of EMO algorithms that are based on Pareto dominance ranking principle, without regard to the computation of the objective functions. Well-known EMO algorithms that use this procedure are: NSGA-II [5], SPEA2 [31], PAES [16], R-NSGA-II [6], Synchronous NSGA-II [8], NSGA-III [4], EPCS [24], etc.

The sequential NDS optimization has been extensively studied. The first NDS version was proposed in [26]. It is based on the brute force method for finding the Pareto sets. The algorithm has \({\mathcal {O}}({ MN}^3)\) complexity because of repeated comparisons without auxiliary structures which store dominance information of the individuals. In [5] the popular NSGA-II algorithm for EMO, the Fast Non-Dominated Sorting (FNDS), was introduced with \({\mathcal {O}}({ MN}^2)\) complexity. For this sequential method, a specific data structure saves a domination count variable and a set of dominated solutions for every individual. It requires \(N^2\) comparisons because all individuals are compared among them once. However, it is possible to complete the individual classification in fronts avoiding unnecessary comparisons. In this line, several improvements were implemented by developing more efficient sorting strategies [7, 11, 23, 27, 29, 30]. It must be noted that the computational burden of these sequential approaches of the NDS procedure has \({\mathcal {O}}({ MN}^2)\) complexity in the worst case.

The divide-and-conquer strategy proposed by Jensen reduced the complexity to \({\mathcal {O}}(N \log ^{M-1}N)\) [15], however this algorithm is not applicable to many instances of EMO problems. In [9], the Jensen algorithm was extended removing its limitation that no two solutions can share identical values for any of the problem’s objectives and the slight modification of [2] maintains \({\mathcal {O}}(N \log ^{M-1} N)\) complexity at the worst-case running time. However, the divide-and-conquer approach has longer processing times when increasing the number of objectives. Recently, several efficient NDS algorithms have been proposed. In [11], Efficient Non-Dominated Sort with Non-Dominated Tree (ENS-NDT) was introduced. It starts with the population ordered by the first objective and uses a novel Non-Dominated Tree (NDTree) to speed up the NDS to reduce unnecessary comparisons. In [23], an efficient NDS method, referred to as Best Order Sort (BOS) was proposed. It begins ordering the population by all objectives. Then, the fronts are built avoiding unnecessary comparisons. Its performance for sorting large populations according to many objectives overcomes previous proposals. BOS has also been used to define a hybrid NDS in combination with a divide-and-conquer approach [17].

However, the improved sequential NDS versions do not cover the computational needs when solving large-scale EMO problems. Therefore, parallel NDS routines should be developed. In [10], a NSGA-II parallel implementation on a GPU, focusing on the acceleration of NDS, has been analysed. However, this GPU NDS version has the highest complexity \({\mathcal {O}}(MN^3)\). Every thread computes the dominance of every individual in parallel without considering dominance information about other individuals. Parallel implementations of NDS (with \({\mathcal {O}}(MN^2)\) complexity) on multicore CPU and GPU have been analysed in [21]. An efficient parallel version of the NDS procedure was formally presented in [25]. The dominance information of individuals is stored in a matrix, but its experimental analysis is very limited. In [19], the same concept is applied to another GPU version of NDS. It is based on a data structure to store the dominance information where the individuals that dominate the population are computed by using fast shuffled reductions of dominance matrices on modern GPUs. Therefore, the GPU versions of NDS analysed in literature are based on the algorithms with the highest computational costs. Moreover, parallel versions of the most efficient sequential NDS algorithms have not been developed since an adaption of the algorithms due to their data dependencies is necessary.

It is remarkable that most NDS schemes, which reduce unnecessary comparisons, became inherently sequential algorithms. In this line, to our knowledge, the fast state-of-the-art algorithm Best Order Sort (BOS), referred to above, makes use of fast implementations of sorting algorithms and removes unnecessary comparisons among individuals. It results in an efficient NDS in terms of runtime, but with a structure which prevents its parallel execution. With the goal to optimize both performance and energy consumption of NDS with respect to the efficient BOS algorithm, two parallel NDS algorithms, MC-BOS and GPU-BOS are introduced in this work. MC-BOS ranks populations in parallel on multicore processors and GPU-BOS solves the same problem on GPUs.

The proposed algorithms start by ordering the population according to the different objectives with a fast parallel routine and then the fronts are built from two ideas: (1) if an individual is not dominated by any solution with a particular rank, then it belongs to that particular rank; and (2) if an individual is dominated by at least one individual of all ranks then the individual is defining a new front of an upper rank. If these principles are considered, the number of comparisons can be reduced since the population is ordered by the objectives.

The contribution of this work is twofold. First, new parallel implementations of NDS procedure on multicore and GPU are analysed, referred to as MC-BOS and GPU-BOS respectively. Second, an experimental evaluation of MC-BOS and GPU-BOS is carried out using modern multicore processors and GPUs to rank populations of different sizes and number of objectives. The runtime and energy consumption are analysed in relation to the sequential BOS. For all tests, both parallel algorithms accelerate the NDS and reduce its energy consumption in relation to the sequential BOS in relevant percentages. The analysis of the obtained results allows us to identify the best platform/algorithm of the parallel NDS according to every problem size.

The paper is organized as follows: Sect. 2 is devoted to describing some relevant concepts related to the EMO problems and BOS algorithm, a state-of-the-art NDS algorithm. Section 3 explains in detail the parallel implementations of BOS introduced in this work, MC-BOS and GPU-BOS. An experimental study of performance and energy consumption of both parallel implementations and the sequential BOS is carried out in Sect. 4. The analysis of the experimental results allows us to define criteria to choose the optimal version/platform in terms of performance and energy consumption. Finally, Sect. 5 shows the conclusions of this work.

2 Background

2.1 Evolutionary multiobjective optimization

We can formulate a multiobjective minimization problem as follows [18]:

$$\begin{aligned} \min \limits _\mathbf{x \in \mathbf S } \mathbf f (\mathbf x ) = [f_1(\mathbf x ), f_2(\mathbf x ), \ldots , f_M(\mathbf x )]^T \end{aligned}$$
(1)

where \( \mathbf z =\mathbf f (\mathbf x ) \) is an objective vector, defining the values for all objective functions \(f_1(\mathbf x )\), \(f_2(\mathbf x )\), \(\dots \), \(f_M(\mathbf x )\), \(f_i\) : \({\mathbb {R}}^V\rightarrow {\mathbb {R}}\), \(i \in \{1, 2, \dots , M\}\), \(M\ge 2\) is the number of objective functions; \(\mathbf x =(x_1, x_2, \dots , x_V)\) is a vector of variables (decision vector) and V is the number of variables \(\mathbf S \subset {\mathbb {R}}^V\) is search space, which defines all feasible decision vectors.

A decision vector \(\mathbf x ' \in \mathbf S \) is a Pareto-optimal solution if \(f_i(\mathbf x ') \le f_i(\mathbf x )\) for all \(\mathbf x \in \mathbf S \), \(i \in \{1, 2, \dots , M\}\) and \(f_j(\mathbf x ')<f_j (\mathbf x )\) for at least one \(j \in \{1, 2, \dots , M\}\). The set of all the Pareto-optimal solutions is called the Pareto set. An objective vector \(\mathbf f (\mathbf x ')\) is a Pareto-optimal vector if \(\mathbf x '\) is a Pareto-optimal solution. The region defined by all the Pareto-optimal vectors is called the Pareto front.

For two objective vectors \(\mathbf z \) and \(\mathbf z '\), \(\mathbf z '\) dominates \(\mathbf z \) (or \(z' \succ z\)) if \(z_i' \le z_i\) for all \(i \in \{1, 2, \dots , M\}\) and there exists at least one \(j \in \{1, 2, \dots , M\}\) such that \(z_j' < z_j\). In EMO algorithms, the subset of solutions in a population whose objective vectors are not dominated by any other objective vector is called the non-dominated set, and the objective vectors are called the non-dominated objective vectors. The main aim of the EMO algorithms is to generate well-distributed non-dominated objective vectors as close as possible to the Pareto front. Many EMO algorithms have been designed and various operator techniques, fitness functions and chromosomal representations can be found in literature, however the outline remains similar. As a rule, the solution process of an EMO algorithm is iterative. An EMO algorithm starts with an initial population consisting of decision vectors randomly generated in the search space. Each iteration of an EMO consists of the following operations:

  • Evaluating each individual.

  • Assigning fitness to each individual.

  • Checking if termination condition is satisfied.

  • Modifying population using selection, mutation and crossover operators.

  • Creating a new population.

The solution process is continued till a stopping criterion is not satisfied, which is usually based on the maximum number of iterations or number of function evaluations. Without regard to the operation of evaluation, every individual which computes the objective functions, the most computationally expensive part of such algorithms is the dominance ranking operators. They are performed in the fitness assignment operation in each iteration of an EMO algorithm. In algorithms based on Pareto dominance, this is implemented by the NDS procedure.

2.2 Non-Dominated Sorting

Non-Dominated Sorting aims to assign different ranks to the individuals, and dividing the population into several non-dominated levels (fronts). According to Pareto dominance, the individuals with the same rank are non-dominated among themselves and they can only be dominated by a solution in a lower rank. It should be noted that dominance comparisons between the individuals are repeated in every iteration on an EMO algorithm and takes the most computational burden.

Best Order SortFootnote 1 algorithm for NDS has been introduced and analysed in [23]. It is considered one of the most efficient NDS algorithms in terms of practical performance [17]. In this work, the principles of BOS are used for developing parallel NDS procedures with a reduced number of comparisons.

Algorithm 1 describes the BOS procedure to compute fronts of the population P with N individuals for a problem with M objectives. It consists of two stages. In the first one the global data structures are initialized and the population is sorted in the ordered sets \(Q_j\) by objectives \(j=1, \ldots M\), using lexicographical ordering in case of a tie. This way, \(Q_j\) can be considered the columns of a matrix, referred to as Q, which is computed in the first stage. So, \(Q_{ij}\) represents the i-th individual in the list of objective j. SC and RC represent the numbers of ranked individuals and fronts computed, respectively. The sets \(L_j^r\) for \(1 \le j \le M\) and \(1 \le r \le N\) define the subset of the front r by the analysis of objective j. When BOS finishes the front r is the union of \(L_j^r\) with \(1 \le j \le M\). Moreover, two N vectors, F and isRanked, are defined to store the rank of every individual and to mark the ranked individuals, respectively.

figure a
figure b

In the second stage, the individuals are ranked by the comparison of their dominance. It starts with the analysis of individuals with better objective values in the sorted sets \(Q_j\). Every individual is checked in the corresponding objective. Then, the sets \(L_j^r \) are filled to compute the fronts and to save the ranked individuals by comparisons to objective j. If an individual, s, is checking for one objective j and previously it had been ranked by the analysis of another objective, then it is added to the set \(L_j^{F_s}\). If the individual s had not been previously ranked, then the routine ranks the individual by the dominance analysis in the objective j as described in Algorithm 2. This way, s is compared to the individuals t in \(L_j^{k}\) for all computed ranks \(1\le k \le RC\). If there is any individual, t, which cannot dominate s, then s is classified in the front of t, \(F_s=F_t\) and added to \(L_j^{F_s}\). If the comparisons finish and s is dominated by the checked individuals, then s defines a new front of higher rank.

Thus, Algorithm 1 optimizes the comparisons to compute NDS as it is based on: (1) the previous sorting of the population by objectives, this way the comparisons to identify the ‘not-worse’ individuals in the corresponding objective can be optimized; (2) the sorted checking which ranks firstly individuals with ‘better’ objectives.

The computational complexity of BOS in the worst case is \({\mathcal {O}}(MN^2)\), however it is relevant to underline that BOS optimizes the number of comparisons at the expense of incrementing its data dependencies. Therefore, it is inherently a sequential algorithm. However, in next sections, several modifications of the BOS algorithm are studied with the goal of defining parallel versions of BOS which can exploit modern parallel architectures and be competitive in relation to the efficient sequential BOS.

3 Parallel implementations of the Non-Dominated Sorting based on Best Order Sort algorithm

Most of the HPC platforms and also modern computers are composed of multicore and GPU devices. CUDA (Compute Unified Device Architecture) is the parallel interface introduced by NVIDIA to help develop GPU codes using C or C++ language. CUDA provides some abstraction to the GPU hardware, and it provides the SIMT (Single Instruction, Multiple Threads) programming model to exploit the GPU. However, the programmer has to take into account several features of the architecture, such as the topology of the multiprocessors and the management of the memory hierarchy. For the execution of the program, the CPU (called host in CUDA) performs a succession of parallel routine (kernels) invocations to the device. The input/output data to/from the GPU kernels are communicated between the CPU and the ‘global’ GPU memories. GPUs have hundreds of cores which can collectively run thousands of computing threads. Each core, called Scalar Processor (SP), belongs to a set of multiprocessor units called Streaming Multiprocessors (SM). The SMs are composed of 192 (or 128) SPs on Kepler (or Maxwell) GPU architectures [12, 20]. This way, the GPU device consists of a set of SMs and each kernel is executed as a batch of threads organized as a grid of thread blocks [1].

3.1 Multicore version of the Best Order Sort algorithm (MC-BOS)

The multicore version is implemented on PthreadsFootnote 2 and C to exploit the parallelism available on modern processors. While the original BOS algorithm loops through the Q matrix rowwise, trying to reduce the number of comparisons needed to rank each individual, each \(Q_j\) set has its own \(L_j\) structure. Therefore, they can be processed in parallel without synchronization.

Algorithm 3 describes the operation of each of the M compute threads. The initial sorting of the population by each objective is efficiently computed by each thread using Footnote 3 with a custom comparison function to consider multiple objectives in case of a tie. Every thread j ranks the population in the order defined by \(Q_j\) and writes the rank of every individual in the shared data structure F (line 9, Algorithm 3). Thus, every thread will rank only the individuals which have not been studied by another thread.

figure e

Although not using any synchronization points increases the performance of the multicore algorithm, it may cause a ‘write after write’ data hazard on the shared array F (that contains the ranks of the population) when several threads try to rank the same individual at the same time. Nevertheless, the definition of domination given in Sect. 2 guarantees that the individuals that dominate a given individual are in lower positions of every sorted set \(Q_j\). As the rank of an individual is the maximum rank of the individuals that dominate it plus one, the ranks computed by the threads are the same and this hazard does not cause wrong results.

3.2 GPU implementation of NDS based on Best Order Sort (GPU-BOS)

figure f

The parallel scheme of MC-BOS is not appropriate to exploit the massive parallelism provided by GPU. As a consequence it is necessary to design a new scheme with specific data structures which allow us to increase the parallelism level of the algorithm. GPU-BOS has been designed by adapting the key ideas of BOS to a massively parallel architecture.

Algorithm 4 shows the host pseudocode of GPU-BOS to compute the NDS on the GPU. It includes three phases. In Phase 1, global parameters are defined, the population data structure is sent from the CPU to the GPU memory and then, it is efficiently sorted by objectives on the GPU. The sorting is computed on the GPU by M executions of the kernel defined by the CUBFootnote 4 library in streaming mode.

In Phase 2, for every individual s, the sub-population which could dominate it is obtained from the matrix Q whose M columns define the sorted population by the objectives. Every individual, s, is classified at each column \(Q_j\). Then, the lowest row of Q which stores s is obtained, \(\mathbf{i}\), and the corresponding column is defined as \(\mathbf{j}\). Thus, \(\mathbf{j}\) represents the objective with best order for the individual s and \(\mathbf{i}\) is the index of s in the sorted population by such objective. As it is justified in [23], the sub-population which could dominate the individual s is defined by the array \(S_s = [ Q_{0,\mathbf{j}}, \ldots , Q_{\mathbf{i}-1, \mathbf{j}}]\).

So, the B matrix is \(2 \times N\) and it saves the mentioned pairs of indexes for every individual and it is computed in parallel on the GPU by the kernels and (lines 7 and 8, Algorithm 4). The kernel also contains a sub-routine to include in the set \(S_s\) of each individual s, the individuals with higher positions but the same value in the objective function studied, in such a way allowing us to use a faster sorting algorithm that does not need lexicographical sorting in case of ties.

Then, Phase 3 iteratively computes a new front till the entire population is ranked. The kernel (Algorithm 5) computes in parallel a new front at every iteration. N blocks of threads are defined and every block analyses if the corresponding individual s is classified in the new front. This analysis is based on the following property [23]: Let s, \(D_s\), and \(r-1\) be an individual, the set of dominators of such individual and the maximal rank of the dominators, respectively, then the rank of s is r.

In kernel every thread block analyses if its individual s belongs to the new front. When s has been previously ranked then the block stops. Otherwise, if the first front is being analysed, every block checks the dominance of s by comparisons to the lists \(S_s\). So, iteratively every batch of blockDim individuals is analysed in parallel. When one dominator is identified the thread block stops checking and the threads that have identified dominators save that information in the column s of the matrix \(\varDelta \). Also, the first thread of each block stores in \(batch_s\) the index of the next batch of individuals to be checked. When one front of \(Rank > 0\) is studied, every thread in the s block reads the dominators previously identified from the s column of \(\varDelta \). If at least one of them has not been ranked, then its rank is \(q \ge Rank\), the rank of s is higher than Rank and therefore s is not classified in the new front, and the thread block stops.

Otherwise, if all dominators have been classified in a front \(q<Rank\) then every thread block checks the dominance of s with an initial sorted set of candidates as dominators of s, \(S_s^* \equiv [ Q_{k,\mathbf{j}}, \ldots , Q_{\mathbf{i} -1, \mathbf{j}} ]\), where \(k=B_{0,\mathbf{j}}\). When the dominance is checked by a thread block this initial set can include individuals previously ranked which are erased from \(S_s^*\). When at least one dominator of s is identified then, it is saved in \(\varDelta \) and also \(k_{stop}\) in batch and the computation stops. Otherwise, there are no new dominators of s, therefore it is classified in the new front. Figure 1 illustrates the scheme to analyse the dominance of the individual s by a thread block with \(blockDim=4\) for a very reduced population when 1st and 2nd fronts are computed by GPU-BOS.

The performance of GPU-BOS is bounded by the kernel. On this kernel there are two types of comparisons. The first type (lines 8–14, Algorithm 5) is the comparison of an individual with its saved dominators. Although each such comparison runs in \({\mathcal {O}}(1)\), one individual can be compared with the same dominator multiple times if the dominator is not yet ranked. However, the maximum number of ranks is N and the rank is incremented after each call to the kernel, so there cannot be more than \({\mathcal {O}}(N)\) of such comparisons for each individual, \({\mathcal {O}}(N^2)\) in total. The second type (lines 15–31, Algorithm 5) is the comparison of an individual with its potential dominators. There are \({\mathcal {O}}(N^2)\) combinations and each comparison runs in \({\mathcal {O}}(M)\).

Therefore, GPU-BOS retains the worst-case complexity of the original BOS algorithm: \({\mathcal {O}}(MN^2)\). Although the computational schemes of GPU-BOS and BOS are different, they are based on the same rules to organize the individuals in order to reduce the number of dominance comparisons.

figure o
Fig. 1
figure 1

Dominance checking to analyse if the individual s belongs to the 1st or 2nd front and data structures involved in CuNewFront. a On the left, the dominance of the sorted list \(S_s=\{f,a,c,e,b,d,g,m,l,h,n\}\) over s is computed to check if the individual s belongs to the first front. The checking is computed in a parallel loop by a thread block. At first iteration, c is identified as a dominator and the s-block stops. Then, the threads write in \(\varDelta _s\) the last dominators computed, c, and also they write 4 in the array batch, the next index to continue the dominance checking for a new call of CuNewFront. These written values are shown on the right of the figure. b On the right, to check if the individual s belongs to the 2nd front, when the last dominators of s computed, saved in \(\varDelta _s\), have no rank then s-block stops. But in the example the last dominator computed, c, has been previously classified in the first front. Then, the dominance over the sorted list \(S_s^*=\{b,d,m,l,n\}\) is computed. Notice that, the individuals g and h have been previously classified in the first front, then they are not included in the checking list. If new dominators of s are not identified, then s is classified in the 2nd front. Otherwise, the individual s is not ranked and it will be involved in the successive calls of CuNewFront to compute new fronts

Table 1 Characteristics of the test GPUs
Fig. 2
figure 2

Experimental Runtime (solid lines) and Energy Consumption (dashed lines) of BOS, MC-BOS and GPU-BOS on the platform \({\mathcal {F}}_1\) for four random populations of four sizes when the number of objectives changes from \(M=5\) to \(M=30\)

Fig. 3
figure 3

Experimental Runtime (solid lines) and Energy Consumption (dashed lines) of BOS, MC-BOS and GPU-BOS on the platform \({\mathcal {F}}_2\) for four random populations of four sizes when the number of objectives changes from \(M=5\) to \(M=30\)

4 Evaluation

This section opens with a technical description of the experimental hardware setup, followed by an analysis of the performance and energy consumption of the BOS, MC-BOS and GPU-BOS algorithms when applied to compute the fronts of populations with different sizes and number of objectives on the target architectures. We assume that users of these routines are interested in specific EMO problems with particular number of objectives and population sizes. Our goal with this experimental analysis is to provide general criteria to help users to choose the best platform/version to obtain the best performance and/or lowest energy consumption for solving specific NDS processes.

Three computational architectures have been considered in the experiments:

\({\mathcal {F}}_1\) :

: Bullx R424-E3: 2 Intel Xeon E5 2650 processors with 8 cores each and 64 GB of RAM. It is connected to a NVIDIA Tesla M2070 GPU. Table 1 provides technical details about this GPU platform.

\({\mathcal {F}}_2\) :

: Bullx R421-E4: 2 Intel Xeon E5 2620v2 processor with 6 cores each and 64 GB of RAM. It is connected to 2 NVIDIA K80 (each NVIDIA K80 is composed by two Kepler GK210 GPUs). The characteristics of each NVIDIA K80 are given in Table 1.

\({\mathcal {F}}_3\) :

: Bullion S8: 8 Intel Xeon E7 8860v3 processors with 16 cores each and 2.3 TB of RAM.

The test platforms do not include the most recent multicore processors and GPUs. However, this hardware is readily available on many currently accessible clusters for scientific computation. Therefore, they can be considered representative examples of available platforms for users of the routines that will be tested. Three kinds of multicore processors are considered.

\({\mathcal {F}}_1\) contains 2 Intel Sandy Bridge EP processors for a total of 16 CPU cores with 64 GB of RAM and 2 NVIDIA Tesla M2070 GPUs of the Fermi microarchitecture. \({\mathcal {F}}_2\) is a newer platform, containing 2 Intel Ivy Bridge EP processors with 64 GB of RAM for a total of 12 CPU cores and 2 NVIDIA Tesla K80 GPUs of the Kepler microarchitecture. Although both of these platforms have multiple GPUs, our GPU-BOS implementation uses only one. \({\mathcal {F}}_3\) contains 8 Intel Haswell processors for a total of 128 cores with 2.3 TB of RAM. This platform is composed of four nodes, with two sockets and 576 GB of RAM per node, interconnected by a proprietary bus that converts them into a single NUMA node.

To provide a fair comparison and avoid the overhead of the JVM, the original sequential BOS Java codeFootnote 5 has been reimplemented in C. All the programs have been compiled using gcc 5.4.0 and nvcc 8.0.44 with optimization flags O3. All platforms run Ubuntu 16.04 LTS with CUDA SDK 8.

For the acquisition of the energy consumption data, we have developed a software tool that collects metrics from various hardware counters integrated on each platform. It uses the Running Average Power Limit (RAPL) interface on Intel processors, introduced on the Sandy Bridge microarchitecture [3] via the Linux Power Cap sysfs. It uses the NVIDIA Management LibraryFootnote 6 (NVML) API on NVIDIA GPUs.

The runtime and energy of sequential BOS, MC-BOS and GPU-BOS has been evaluated when the number of objectives, M, and population size, N, vary on the platforms \({\mathcal {F}}_1\) and \({\mathcal {F}}_2\). On \({\mathcal {F}}_3\) only results for BOS and MC-BOS are shown since there is no GPU on this platform. Figures 2, 3 and 4 represent the experimental results with four graphics for random populations with different numbers of individuals \(N=5000, 20{,}000, 50{,}000, 100{,}000\). To obtain these results, 100 test populations have been randomly generated by the testing scripts, following a uniform distribution in the range [0, 1). For each population and implementation, we have repeated the experiment 10 times, for each case discarding the best and worst runtime, and averaging the rest to obtain the experimental results shown in this section.

Every plot represents, on the left, the runtimes in milliseconds (with solid lines) and, on the right, energy consumption in joules (with dashed lines) for the different NDS versions. The energy measurement results are generated taking into consideration only the domains used by each implementation. This means that, for the sequential algorithm, we only consider the processor where it is running. For the multicore algorithm we consider as many processors as needed to allocate the number of threads spawned. For the GPU algorithm, we consider the processor where the host code is running and the GPU where the kernels are launched.

Fig. 4
figure 4

Experimental Runtime (solid lines) and Energy Consumption (dashed lines) of BOS and MC-BOS on the platform \({\mathcal {F}}_3\) for four random populations of different sizes when the number of objectives changes from \(M=5\) to \(M=30\)

Fig. 5
figure 5

Experimental acceleration factors of runtime and energy savings for multicore version (on the top) and GPU version (at the botton) in relation to the sequential BOS on the platform \({\mathcal {F}}_1\) for four random populations of four sizes when the number of objectives changes from \(M=5\) to \(M=30\)

Fig. 6
figure 6

Experimental acceleration factors of runtime and energy savings for multicore version (on the top) and GPU version (at the botton) in relation to the sequential BOS on the platform \({\mathcal {F}}_2\) for four random populations of four sizes when the number of objectives changes from \(M=5\) to \(M=30\)

The general trends of BOS, MC-BOS and GPU-BOS when the objectives increase are similar on every plot, although the random populations are different on each platform. The runtimes of BOS increases almost linearly for the smallest populations with \(M = 5, 10, 15\) and for \(M = 20, 25, 30\) the runtime lightly increases for small populations. BOS runtime even decreases, since for the larger populations the percentage of non-dominated individuals increases as N and M, decreasing in such cases the comparisons needed to compute fronts. MC-BOS achieves a high acceleration in relation to BOS in terms of runtime and energy. MC-BOS runtimes increase as M increases, and the slope of this increment is less relevant as the individual count increases. This trend is clearly appreciated on the three platforms in general terms. On \({\mathcal {F}}_1\) (\({\mathcal {F}}_2\)) there are relevant runtime increments between \(M=15\) and 20 (\(M=10\) and 15) because the number of cores available on the platform \({\mathcal {F}}_1\) (\({\mathcal {F}}_2\)) becomes less than M, that is \(M > 16\) (\(M>12\)). Therefore, since the number of threads is defined by M, as M increases several threads are concurrently executed on the same core. The runtimes of GPU-BOS for the smaller populations decrease as M increases with a more relevant slope when the population is large.

On platform \({\mathcal {F}}_1\), GPU-BOS is slower than MC-BOS when \(M \le 15\) and \(N < 10{,}000\). For \(10{,}000 \le N < 100{,}000\) and \(M > 15\), GPU-BOS is faster than MC-BOS and, in some cases, more energy efficient. For populations larger than that, the multicore implementation scales better than the GPU implementation. The high energy consumption of these old NVIDIA Fermi cards is relevant, making GPU-BOS less energy efficient than even the sequential implementation for \(M = 5\).

On platform \({\mathcal {F}}_2\), the newer NVIDIA Fermi cards show better performance than their Fermi counterparts, being faster than the multicore implementation when \(M \ge 15\) for all the population sizes shown. They are also energy efficient, making the GPU implementation the best one in both metrics for most of the test cases.

On platform \({\mathcal {F}}_3\), there are enough cores for the number of threads spawned, so the situation where multiple threads execute concurrently on the same core does not appear. So, the slope of the performance as M increases is much more gradual.

Additionally, experiments on NDS were performed on the basis of the NSGA-II algorithm. Populations after several NSGA-II generations with DTLZ2 test functions [13] with \(M=5, \dots , 30\) have also been analysed. This experimental study has not been included in this section because the acceleration factors achieved by MC-BOS and GPU-BOS, in terms of performance and energy, are similar to the study carried out for random populations. Therefore, the general conclusions of this study were the same.

Fig. 7
figure 7

Experimental acceleration factors of runtime and energy savings for the multicore version in relation to the sequential BOS on the platform \({\mathcal {F}}_3\) for four random populations of four sizes when the number of objectives changes from \(M=5\) to \(M=30\)

The advantages in performance and energy consumption of both parallel versions in relation to the sequential BOS are analysed in Figs. 56 and 7 for the three test platforms respectively. These plots have been obtained from the runtime and energy consumption above shown in Figs. 2, 3 and 4. The acceleration factors of MC-BOS vs BOS (GPU-BOS vs BOS) range from \(3\times \) to \(14\times \) (\(2\times \) to \(14.5\times \)) on platform \({\mathcal {F}}_1\), \(2\times \) to \(8\times \) (\(2\times \) to \(17.5\times \)) on platform \({\mathcal {F}}_2\) and \(3\times \) to \(15.6\times \) on platform \({\mathcal {F}}_3\). The energy savings factors are slightly lower, from \(1.5\times \) to \(5\times \) (\(0.5\times \) to \(2.8\times \)) on platform \({\mathcal {F}}_1\), \(1\times \) to \(3\times \) (\(0.5\times \) to \(6\times \)) on platform \({\mathcal {F}}_2\) and \(2\times \) to \(8\times \) on platform \({\mathcal {F}}_3\).

Summarizing, parallel versions are faster and consume less energy than the sequential BOS in relevant percentages, in spite of the irregularity of both parallel algorithms. If the number of objectives is less than the number of cores in the processor, then the best option is the multicore version to optimize performance and energy. If there are many objectives, then, the GPU version is the optimal selection.

5 Conclusions

This work has proposed two parallel procedures, MC-BOS and GPU-BOS, to improve the performance and the energy consumption to compute Non-Dominated Sorting on multicore and GPU architectures. Both procedures are based on the principles of the Best Order Sort algorithm and their source code is publicly available at GitHubFootnote 7. It is a state-of-the-art procedure to efficiently rank populations avoiding unnecessary comparisons to a sequential structure. To improve the performance of BOS by the exploitation of modern multicore processors and GPUs, two schemes have been developed to increase the parallelism level of this algorithm. MC-BOS defines the same number of threads as objectives and efficiently exploits multicore processors. GPU-BOS defines a scheme that tries to reduce the number of comparisons needed for sorting without the need for large dominance data structures. It reduces both the comparisons and the memory requirements in relation to other GPU versions of NDS.

From the evaluation results, it can be concluded that both parallel algorithms improve the performance of the BOS algorithm in factors that reach \(17.5\times \) and reduce the energy consumption in factors that reach \(8\times \). The multicore version is the best option to optimize performance and energy consumption when the number of objectives is not excessive, otherwise the best option is the GPU version. These results are a milestone due to the irregularity of NDS procedures that optimize the sequential performance, such as BOS. They are inherently sequential and a re-definition of schemes of the parallel algorithms to exploit the moderate and massive parallelism of multicore processors and GPUs has been necessary.

Our future work is focused on the evaluation of MC-BOS and GPU-BOS on novel architectures, such as Skylake processors of Intel and Volta GPUs of NVIDIA.