1 Introduction

Currently, real-time monitoring and control systems based on Multi-Objective Optimization (MOO) are actively developed and applied in different areas. Recently, several real-time traffic signal control adaptive systems that are able to totally adapt the signal timing to the traffic situation based on multi-objective optimization have been developed [3, 12, 20, 24]. An Evolutionary Multi-Objective (EMO) algorithm was applied for operation optimization in a real-time water supply system [32]. In [2], a controller based on a EMO algorithm was used in simulation optimization methodology to solve a real-time multi-objective dispatching decision problem. In [26], the NSGA-II [6] algorithm was applied in microcontroller-based polarization control system. In [10], a microcontroller-based on Artificial Neural Network and the NSGA-II algorithm was developed. It could be applied for different fast real-time intelligent control applications with a non-linear model predictive strategies. There are still many more real-time systems in which optimization problems are solved by EMO algorithms [5, 25]. Summarizing, MOO real-time systems are of great interest and the integration of EMO algorithms enables to provide fast enough good solutions. Determination of the Pareto front is the main goal of MOO. However, it is impossible for some problems to identify the exact Pareto front due to reasons such as continuity of the front, nonexistence of analytical expression or complexity of the problem being solved. On the other hand, in real-world applications it is usually not necessary to find the whole Pareto front, but rather its approximation.

Some well-known EMO algorithms to approximate the Pareto front are NSGA-II [6], PAES [21], MOAE/D [39], IBEA [41], SPEA2 [42], etc. Several phases can be identified in most of them: evaluation of an objective function, Pareto dominance ranking (Non-Dominated Sorting) and genetic operations. The most computationally expensive phase is the Pareto dominance ranking. Examples of EMO approaches based on Pareto dominance ranking are PESA-II [4], NSGA-II [6], R-NSGA-II [7], Synchronous R-NSGA-II [15], etc. In this work, NSGA-II is taken as an example of EMO algorithm to be adapted to low-power platforms and evaluated on them.

Currently, embedded systems for real-time computing combine low-consumption and heterogeneous systems (CPU and GPU). Representative examples of this kind of platforms are the NVIDIA Jetson platforms, widely used in real-time systems.Footnote 1 In the context of real time, the use of NVIDIA Jetson platforms can be useful to solve MOO problems using EMO algorithms for two main reasons: (1) the low-energy consumption because of the specific technology of this kind of platforms; (2) the exploitation of the multicore and the GPU of the NVIDIA Jetson allows the acceleration of programs. For this aim, OpenMP and CUDA can be used as interfaces to develop parallel programs.

Some parallel versions of NSGA-II have been developed: in [9, 11, 13] parallel versions of NSGA-II algorithm based on master–slave paradigm are presented, where population is distributed among the workers to speed up the process of functions evaluation; in [22] several parallel strategies where the Pareto ranking is parallelized in NSGA-II are proposed, and are experimentally investigated when solving the competitive facility location problem in [23]; in [40] a parallel version with individual migration of NSGA-II is investigated; in [38] an EMO parallel algorithm for GPU was developed and tested on several two- and three-objective benchmark problems in [37]; a multi-objective version of Differential Evolution was parallelized on GPU in [33]. Usually, the cost of the evaluation of the objective function in EMO approaches is not very high in the sense of computation time. Hence, most of the computation time is used by checking the Pareto dominance ranking, which is implemented in the Fast Non-Dominated Sorting (FNDS) procedure [6]. The complexity of the FNDS procedure is \(O(MN^2)\), where M is the number of objectives and N the total size of population. As FNDS consumes most of the NSGA-II runtime, the acceleration of the FNDS procedure is mandatory to speed up NSGA-II, it has been justified in [35]. The reduction of the complexity order of the FNDS has been a focus of interest for researchers [14, 19, 27, 34, 36]. Some improvements were implemented by developing more efficient sorting strategies, however, computational burden of the FNDS procedure has a complexity \(O(MN^2)\) in the worst case for all the approaches. Thus, parallel strategies should be considered to accelerate the computation of the procedure.

Recently, a novel NSGA-II parallel implementation on a GPU, focusing on the FNDS procedure and achieving promising speedups has been proposed in [16]. The NDS version of complexity \(O(MN^3)\) is accelerated on GPU and every thread computes the dominance of every individual without writing in auxiliary structures. Therefore, it is a GPU parallel NDS version with redundant dominance comparisons (hereinafter this version is referred to as Gupta-NDS). Moreover, an efficient parallel version of the FNDS procedure has been formally presented in [35], but its experimental analysis is very limited. Both works are very related to our proposal, since our objective is the acceleration of the Pareto dominance ranking on the integrated GPU of the embedded systems. However, our proposal, referred to as Efficient Fast NDS (EFNDS), defines a new data structure to store the dominance information which efficiently allows to compute the individuals that dominate the population by using shuffled reductions on modern GPUs. Additionally, another GPU version of NDS based on Gupta-NDS and similar to the proposal [16] has been also developed. The performance of both implementations on Jetson has been evaluated. Comparative results show that the performance on GPU of EFNDS overcomes Gupta-NDS when the number of fronts and the memory bandwidth are high.

Bearing in mind the limited computational resources on the Jetson platform and our proposal to efficiently compute the Pareto dominance ranking (EFNDS), it is relevant to identify the scale of the problem that can be solved with NSGA-II algorithm on this kind of platforms.

The main contributions of this work are: (1) a novel proposal (EFNDS) to compute the Pareto dominance ranking on the GPU with a better performance than others proposed in the literature. EFNDS is very appropriate for embedded systems because it only consumes energy to compute useful computation with memory requirements of medium size which are supplied by these kind of systems. Results show that this proposal achieves better performance than Gupta-NDS when several fronts are computed; (2) the estimation of the problem sizes that can be solved on a Jetson platform as a prototype of modern embedded system. The rest of this paper is organized as follows. In Sect. 2, the definition of the multi-objective problem is provided. Section 3 describes the Efficient Fast Non-Dominated Sorting procedure (EFNDS) as well as its parallel implementations. An experimental evaluation of the parallel implementations on a Jetson platform is discussed in Sect. 4. Finally, Sect. 5 shows the conclusions of this work.

2 Description of the problem

A multi-objective minimization problem is formulated as follows [28]:

$$\begin{aligned} \min \limits _\mathbf{x \in \mathbf S } \mathbf f (\mathbf x ) = [f_1(\mathbf x ), f_2(\mathbf x ), \ldots , f_M(\mathbf x )]^T \end{aligned}$$
(1)

where \( \mathbf z =\mathbf f (\mathbf x ) \) is an objective vector, defining the values for all objective functions \(f_1(\mathbf x ), f_2(\mathbf x ), \ldots , f_M(\mathbf x ), f_i\) : \(\mathbb {R}^V\rightarrow \mathbb {R}, i \in \{1, 2, \ldots , M\}\), where \(M\ge 2\) is the number of objective functions; \(\mathbf x =(x_1, x_2, \ldots , x_V)\) is a vector of variables (decision vector) and V is the number of variables \(\mathbf S \subset \mathbb {R}^V\) is search space, which defines all feasible decision vectors.

A decision vector \(\mathbf x ' \in \mathbf S \) is a Pareto optimal solution if \(f_i(\mathbf x ') \leqslant f_i(\mathbf x )\) \(i=1, \ldots ,M\) for all \(\mathbf x \in \mathbf S \) and \(f_j(\mathbf x ')<f_j (\mathbf x )\) for at least one j. Objective vectors are defined as optimal if none of their elements can be improved without worsening at least one of the other elements. An objective vector \(\mathbf f (\mathbf x ')\) is Pareto optimal if the corresponding decision vector \(\mathbf x '\) is Pareto optimal. The set of all the Pareto optimal decision vectors is called the Pareto set. The region defined by all the objective function values for the Pareto set points is called the Pareto front.

For two objective vectors \(\mathbf z \) and \(\mathbf z ', \mathbf z '\) dominates \(\mathbf z \) (or \(z' \succ z\)) if \(z_i' \leqslant z_i\) for all \(i=1, \ldots , M\) and there exists at most one j such that \(z_j' < z_j\). In EMO algorithms, the subset of solutions in a population whose objective vectors are not dominated by any other objective vector is called the non-dominated set, and the objective vectors are called the non-dominated objective vectors. The main aim of the EMO algorithms is to generate well-distributed non-dominated objective vectors as close as possible to the Pareto front.

NSGA-II [6] is the most widely used and well-known EMO algorithm for approximating the Pareto front that is based on NDS. It has been considered to analyse the energy efficiency of EMO algorithms when different number of CPU cores and/or GPU cards are exploited. The steps of NSGA-II are described in Algorithm 1.

figure a

Step 2 and Step 5 of the Algorithm 1 are devoted to computing the NDS procedure, which is the most computationally expensive task in the NSGA-II.

3 Efficient Fast Non-Dominated Sort

The procedure referred to as ‘Fast Non-Dominated Sort’ was proposed by Deb et al. in [6] to compute the Steps 2 and 5 of Algorithm 1. In such proposal, the dominance information of every individual consists of: the number of dominator individuals, the number and the list of dominated individuals. Then, the fronts are computed from this dominance information. The complexity order of this process is \(O(MN^2)\) and it requires a storage of \(O(N^2)\). Previous proposals of NDS have a high complexity \(O(MN^3)\) and a high number of redundant comparisons between pairs of individuals, but as a counterpart their storage requirements grows up as O(N). As mentioned in Sect. 1, there are several proposals to define efficient sorting strategies which reduce the redundant comparisons, with a computational burden of complexity \(O(MN^2)\) in the worst case. The redundancy level of NDS increases as the number of fronts increases and can be an efficient approach if the population is classified in few fronts. So, two kinds of approaches to compute NDS can be distinguished: FNDS with a previous dominance computation which has few redundant comparisons and an intensive memory use; NDS with redundant comparisons to evaluate the dominance without additional memory requirements.

In this work, a new approach to compute and store the dominance information is proposed. The corresponding FNDS procedure is described in Algorithm 2 and referred to as Efficient FNDS (EFNDS). It selects N individuals as ‘Elite population’ from the initial population \(P_0\) with 2N individuals (step 5 of Algorithm 1). Algorithm 2 includes two phases. The Phase 1 computes the dominance matrix, D, of dimensions \(2N\times 2N\), for the initial population and the Phase 2 selects the individuals of the first fronts until half of them are selected in the ‘elite’ set. The structure of the Phase 2 is well known. Our proposal is based on the previous computation of a dominance matrix D whose elements, \(D_{i,j}\), store if the individual \(P_i\) is dominated by \(P_j\); that is, \(D_{i,j}=1\) if \(P_j\) dominates \(P_i\) and \(D_{i,j}=0\) in other case.

This way, the idea for checking if \(P_i\) is dominated by the population P (line 14 of Algorithm 2) is based on the computation of the number of individuals of P which dominate \(P_i\) as \(d_i=|\{ D_{i,j}=1; \, \forall {P_j} \in P \}|\), so if \(d_i=0\) then \(P_i\) is not dominated by P and it belongs to the new front. Therefore, the computation of \(d_i\) can be carried out reading the corresponding values \(D_{i,j}\) on the ith row of D.

figure b

3.1 EFNDS on GPU

Compute Unified Device Architecture (CUDA) is the parallel interface introduced by NVIDIA to help develop parallel codes using C or C++ language. CUDA provides some abstraction to the GPU hardware, and it provides the SIMT (Single Instruction, Multiple Threads) programming model to exploit the GPU [31]. However, the programmer has to take into account several features of the architecture, such as the topology of the multiprocessors and the management of the memory hierarchy. For the execution of the program, the CPU (called host in CUDA) performs a succession of parallel routine (kernels) invocations to the device. The input/output data to/from the GPU kernels are communicated between the CPU and the ‘global’ GPU memories. Integrated GPUs (such as the GPUs in Tegra processors) share the low levels of cache memory with the CPU and it is not necessarily the replication of the input–output data of the GPU [30]. GPUs have hundreds of cores which can collectively run thousands of computing threads. Each core, called Scalar Processor (SP), belongs to a set of multiprocessor units called Streaming Multiprocessors (SM). The SMs are composed of 192 (or 128) SPs on Kepler (or Maxwell) GPU architectures [17, 29, 31]. This way, the GPU device consists of a set of SMs and each kernel is executed as a batch of threads organized as a grid of thread blocks [1].

figure c

It is relevant to underline that the computation of dominance consumes most of the runtime of FNDS procedures when it is executed in sequence [16, 35]. Our approach to compute D with a high level of parallelism, can efficiently compute the dominance information on the GPU architecture. Moreover, the fronts computation can also be accelerated on GPU by the use of the fast shuffled reductions of CUDA. This way, our parallel EFNDS version efficiently computes D and the corresponding fronts on GPU.

Algorithm 3 shows the host pseudocode to compute EFNDS on GPU. It is based on two kernels: CuDominance (line 2) which computes the matrix D and CuFronts (line 7) which computes one front from the dominance information stored in D. CuDominance is executed once and CuFronts is iteratively computed until a Elite population of N individuals is classified. To fit the classified population to N individuals, a subset of last front is selected according to the crowding distance on CPU.

Algorithm 4 shows the procedure to compute the dominance matrix on GPU, CuDominance. The input is the population of 2N individuals. This way, \(4N^2\) threads are activated to check concurrently the dominance between all pairs of individuals and update the elements of D without any synchronization among them.

figure d

When CuDominance finishes, the matrix D is on memory and the fronts can be computed by successive executions of CuFronts. Algorithm 5 describes the procedure to compute one front on GPU. One thread block is activated for every individual. Therefore, the threads of every block compute the reduction of a particular row of D and a fast reduction scheme based on shuffled functions is applied.Footnote 2 The shuffled functions enable a thread to directly read a register from another thread in the same warp; therefore, threads can compute fast reductions avoiding memory accesses.

It is noteworthy that the output of CuFronts is written in a vector F of dimension 2N, where every element \(F_i\) stores the front number for the individual i. So, when CuFronts is executed, a new selection of individuals is classified in the new front. Additionally, F defines the set of individuals in the population P since if \(F_i=0\) then the individual i has not classified and it is a potential individual of posterior fronts, i.e. \(i \in P\). Posterior fronts classifications will be computed only for the individuals with \(F_i=0\). Therefore, F also defines the population P to compute a new front. Figure 1 illustrates the computation of the fourth front on the GPU. It shows the data structures and their threads mapping involved in the computation of a front.

figure e
Fig. 1
figure 1

Computation of the fourth front from D matrix when CuFronts kernel is executed on GPU

Next section evaluates the improvement in the performance of NSGA-II on a Jetson platform when the EFNDS is computed on the GPU integrated in the Tegra processor.

4 Experimental evaluation

In this section, the performance of two parallel GPU versions of NSGA-II has been evaluated on the Jetson platform. These parallel versions are based on: (1) the EFNDS algorithm described above; (2) the Gupta-NDS version developed in [16]. Additionally, the sequential NSGA-II version 1.1.6 of K. DebFootnote 3 has also been evaluated.

DTLZ2 test problem has been considered in the evaluation because the number of objective functions can be defined by the user [8, 18]. DTLZ2 problem was specially designed for evaluating multi-objective algorithms in this context. Moreover, it can be configured to have high computational demands by establishing a large number of objectives and variables. The formulation of the problem is as follows:

$$\begin{aligned}&\min {f_1}(\mathbf x ) = (1 + g(x)) \prod _{i=1}^{v-1} \mathrm{cos}\left( \frac{x_i\pi }{2}\right) \nonumber \\&\min {f_2}(\mathbf x ) = (1 + g(x))\mathrm{sin}\left( \frac{x_{v-1}\pi }{2}\right) \prod _{i=1}^{v-2} \mathrm{cos}\left( \frac{x_i\pi }{2}\right) \nonumber \\&\cdots \\&\min {f_l}(\mathbf x ) = (1 + g(x))\mathrm{sin}\left( \frac{x_{v-l+1}\pi }{2}\right) \prod _{i=1}^{v-l} \mathrm{cos}\left( \frac{x_i\pi }{2}\right) \nonumber \\&\cdots \nonumber \\&\min {f_v}(\mathbf x ) = (1 + g(x))\mathrm{sin}\left( \frac{x_1\pi }{2}\right) \nonumber \end{aligned}$$
(2)

where \(g(\mathbf x ) = \sum _{i=v}^{n} (x_i-0.5)^2 , x_i \in [0, 1] \). The number of variables n is selected according to the equation \(n = v + k - 1\), with a suggested value of \(k=10\).

Other test problems have been evaluated obtaining similar results (DTLZ5 and DTLZ7). However, for the sake of clarity they have not been included in this study.

The Jetson platform has been selected as a prototype of a low-power platform because: (1) its Tegra processor contains a multicore-CPU and a GPU and they can support the parallel programming interfaces as extended as CUDA or OpenMP; (2) the clock frequencies of cores and memory buses of the GPU and the CPU of the Jetson can be modified to control the energy consumption.

The hardware setup we use is based on a NVIDIA Jetson TK1 development board,Footnote 4 embedding a Tegra K1 SoC (System on a Chip) processor with 2 GB of DDR3L RAM.

On the one hand, the Tegra K1 includes an NVIDIA GPU with 192 CUDA Kepler cores and an ARM quad-core Cortex-A15 variant of low-power architecture at 2.32 GHz. The board runs an extended version of Linux Ubuntu 14.04.4 LTS adapted for the ARM architecture. The codes for the ARM are compiled using gcc v.4.8.4, while the codes developed for the NVIDIA GPU are compiled with nvcc v.6.5.12.

The memory requirements of the EMO problems depend on three parameters: the size of the population, the number of variables and the number of objective functions. These parameters have to be defined bearing in mind that the Jetson board has only 2 GB of physical memory (shared by the ARM CPU and the CUDA GPU). Therefore, we have considered instances of the NSGA-II problem which fit on the Jetson platform.

As mentioned above , the Tegra processors allow us to control the clock frequencies of the GPU, CPU and memory controller, that is, the efficiency of cores and/or memory of the TK1 board can be configured. Therefore, NSGA-II based on EFNDS and Gupta-NDS can be evaluated with different frequency configurations. The analysis of these results can determine the most used resources by every NSGA-II version.

Fig. 2
figure 2

Runtimes of the three versions of NSGA-II studied for DTLZ2 with \(M=5, N=10{,}000\) and varying frequencies of the GPU core clock (GBUS) and the External Memory Controller clock (EMC)

As shown in Fig. 2, our implementation is dramatically influenced by the memory controller clock, while the performance of the Gupta version does not differ much. However, when we change the frequency of the GPU core clock, we can observe that our implementation keeps its performance, while Gupta is strongly penalized. This is consistent with our theoretical considerations, as Gupta’s implementation does not store any dominant information in memory, while EFNDS heavily uses memory to reduce the number of redundant comparisons needed to evaluate the dominance between the individuals. Notice that the CPU performance is also affected by the EMC frequency. It proves that the memory bus is shared by CPU and GPU devices.

Table 1 Runtimes of NSGA-II based on: sequential version of NDS (DebSeq), parallel EFNDS, parallel Gupta-NDS and adaptative version (with its acceleration factor AF*)

Table 1 shows the runtimes achieved by the NSGA-II versions based on sequence of K: Deb, parallel EFNDS, parallel Gupta-NDS and the parallel version which combines both. The results also show that there is no optimal implementation for all problems.

To always choose the best NDS implementation, we have also developed a hybrid version that switches between both implementations. As can be observed in the last column of Table 1, this strategy has remarkable results. This behaviour is shown in Fig. 3, where we plot the time spent on each generation for each strategy. As we can observe, EFNDS with many objectives and many fronts is faster while Gupta is better when there is only one front. We have concluded that this is a situation that commonly appears on the last generations of each execution, so we propose an adaptative approach which chooses the best one in each case.

It is assumed that the starting number of fronts is high. So, in both cases, the adaptative approach selects FNDS at the first generation. Then, it checks if the Gupta version can improve the performance when the number of fronts decreases. For DTLZ2 and \(M = 5\), this happens in the last generations, so it switches from EFNDS to Gupta. For \(M = 15\), EFNDS is always faster than Gupta even though most generations only compute one front, so the adaptative approach always selects FNDS.

Finally, the performance of all parallel GPU versions are improved over the sequential CPU version, with respectable acceleration factors, as shown in the last column of Table 1.

Fig. 3
figure 3

Time evolution on each NSGA-2 iteration for the test problem DTLZ2 with \(M = 5,15\) and \(N = 10{,}000,20{,}000\)

5 Conclusions and future works

In this work, a new parallel proposal to accelerate the NSGA-II for solving multi-objective problems has been proposed. It has been evaluated in comparison with the recent proposal [16] on a Jetson TK1 as a prototype of modern embedded system of low power. The results have shown that EFNDS achieves better performance than Gupta when several fronts are computed and/or there is a high number of objectives. We also propose an adaptative version that switches between both of them, improving the performance of NSGA-II in all test cases.

Furthermore, we show that the Jetson TK1 embedded platform is adequate to accelerate EMO algorithms for large numbers of objectives and populations, achieving respectable performance and high acceleration factors.

As future works, we are planning to evaluate the energy efficiency of the adaptative approach on this platform. We are also testing the performance of our implementations on other GPU platforms.