1 Introduction

Genetic Programming has been widely adopted by the Search Based Software Engineering community: its application ranges from fault localisation [7, 12, 14], Genetic Improvement [5, 9], and program repair [3, 8]. Improving its efficiency and scalability would have a far reaching impact across the application domains.

Parallelisation is one of the most promising technique for scalability. Population based evolutionary computation has been described as ‘embarrassingly parallel’, because the fitness evaluation of each individual solution in the population is often completely independent from each other and, consequently, can be performed in parallel. This is particularly the case with Genetic Algorithms (GAs): GAs need to apply the same fitness function(s) to the entire population, which essentially consists of input data to the fitness function(s).

General Purpose Computing on Graphics Processor Units (GPGPU) exploits the Single Instruction Multiple Data (SIMD) architecture of graphics shaders to parallelise computation [4]. The SIMD architecture fits the parallel fitness evaluation of GAs naturally, and has provided significant speed-ups for search-based test suite minimisation [13].

Genetic Programming (GP), on the other hand, keeps a population of programs. Parallelisation at the GP population level is not possible, as it would not fit the SIMD architecture. Instead, GP can be parallelised at the training data level. Usually, a single candidate GP solution has to be evaluated against many data points in the training set, which can be done in parallel.

However, this GPGPU based fitness evaluation for GP requires the conversion of GP trees into GPGPU executable kernels. The conversion involves the kernel compilation, which is a time consuming process that is external to the GP. The cost of kernel compilation raises the issue of cost-benefit trade-off for GPGPU.

This paper evaluates different methods of amortising the cost of kernel compilation using CUDAFootnote 1 toolkit. Kernel per individual method converts each individual GP tree into a separate CUDA kernel. Kernel per generation aggregates all individuals in the population and performs a single compilation of all individuals. Finally, kernel interpreter method uses an expression interpreter: after a single compilation of the interpreter, GP can evaluate whatever GP tree without any further compilation. We use the CPU based GP as the baseline, and the Dow chemical data symbolic regression as the benchmark problem [10].

2 Evaluating GP Trees Using GPGPU

To achieve data level parallelisation using GPGPU, the kernel should be generated dynamically. Here, we introduce three different approaches.

  • Kernel per Individual: The most intuitive approach to convert GP trees into CUDA kernel is to generate a single CUDA kernel for each candidate solution. Since the only difference between candidates is the expression they represent, kernel source code can be generated using templates: we only need to convert GP trees into infix expressions that conform to CUDA kernel syntax. While intuitive, a drawback of this approach is that we have to invoke CUDA compilers for each individual. If the population size is large, this may cause a significant overhead.

  • Kernel per Generation: To reduce the kernel compilation overhead, we can generate one kernel source code file per generation: the single file will contain multiple kernels, each corresponding to the individual solutions in the GP population. While this results in much longer kernel source code files (and hence increased compilation time), we expect to save the overhead of invoking CUDA compilers multiple times.

  • Kernel Interpreter: One technique that has been studied in the GP literature [1, 11] is to use a single kernel that can interpret GP candidate solutions. Using an interpreter, the GP population is transferred to the GPU as data, which are then interpreted and evaluated against the data points in the training data. The kernel interpreter method requires only one compilation throughout the entire GP run. While this significantly reduces the kernel compilation time, it increases the complexity of the CUDA kernel, which in turn affects the performance of GPGPU. We implemented an RPN(Reverse Polish Notation) based CUDA interpreter kernel for this study.

3 Experimental Setup

3.1 Research Questions

Two major drivers of the computational load of GP fitness evaluation are the population size and the training dataset size. Both directly affects the number of fitness evaluations that have to be performed. We formulate our Research Questions around these two factors as follows:

  • RQ1. Which approach performs best against different training dataset sizes?

  • RQ2. Which approach performs best against different population sizes?

Our study uses the Dow Chemical symbolic regression benchmark [10] to investigate the performance of different parallelisation approaches: it contains a training dataset with 747 data points consisting of 57 independent variables and 1 dependent variable. We answer RQ1 by artificially controlling the size of the training dataset and comparing the efficiency of different GPGPU approaches as well as the CPU baseline. Since the aim of our study is not to improve the accuracy of the evolved expression, we simply repeat each data point in the training dataset \(10^0, 10^1, 10^2, 10^3\), and \(10^4\) times to generate datasets with different sizes. We answer RQ2 by running GP with population size of 50, 100, and 200, and comparing the results. Every experiment is conducted 30 times.

3.2 Configurations and Environments

We use DEAPFootnote 2 to implement different approaches: we use the 57 independent variables in the Dow Chemical problem set as GP terminal nodes, and include addition, subtraction, multiplication, division, and negation. The GP uses a three-way tournament selection, a single point crossover with the rate of 0.6, and uniform mutation with the rate of 0.01. The tree depth is set to 4. We set the termination criterion as reaching 10 generations.

The experiments were conducted with an Intel i7-6700 CPU machine with 32 GB of RAM, running Ubuntu 14.04.5 LTS. The GPGPU has been performed on an NVIDIA TITAN X with 12 GB of GDDR5X using CUDA version 8.0. DEAP has been executed using Python 3.4.

Fig. 1.
figure 1

Plot of three parallelisation approaches and CPU based GP on the fixed population size 50. The y-axis is shown on logarithmic scale.

4 Results

Figure 1 shows the execution time of GP using three parallelisation approaches as well as CPU, against five different training dataset sizes and the population size of 50. The lines connecting boxplots are merely visual aids for identifying the same approach: they connect the mean value of each boxplot. As the training data size increases, the efficiency of CPU based GP and kernel interpreter deteriorate sharply, whereas both kernel per individual and kernel per generation approach show relatively stable performance. However, up to the dataset size of 74,700, the interpreter approach shows the best performance.

The performance deterioration of the interpreter approach is due to the overhead of RPN-based expression evaluation. When the size of the training dataset is relatively small, the savings in kernel compilation time compensate for this overhead. With larger dataset (i.e. more kernel execution), the interpreter overhead cancels out the savings in compilation time.

The interpreter overhead is mainly due to two factors. First, the interpreter uses CUDA registers to maintain a stack, increasing the I/O overhead compared to kernels that hardcode the expression (kernels per individual and kernel per generation). If the stack becomes too large to be contained within registers, we may have to rely on even slower memory, increasing the I/O overhead even further. Second, the interpreter makes more function calls internally, compared to the hardcoded kernels: the increased branching also deteriorates the performance of the kernel interpreter.

To answer RQ1: the best parallelisation approach is determined by the trade-off between compilation time and the computational overhead of the kernel interpreter. Above a certain number of kernel executions, the interpreter loses the savings from the fewer compilations.

To answer RQ2, we fixed the size of dataset and varied the population size. The results in Fig. 2a show that, for the data size of 74,700, the interpreter method outperforms all other approaches, regardless of the population size (i.e. its performance overhead is still being cancelled out by the savings in the compilation time). Note that the kernel per individual approach performs worse than the CPU. However, in Fig. 2b, kernel per generation performs the best. In fact, the relative order between approaches is the same as in Fig. 1 with dataset size of 747,000, regardless of the population size.

Fig. 2.
figure 2

Plot of three parallelisation approaches and CPU based GP on the fixed data size 74,700 and 747,000. Both y-axes are shown on logarithmic scale.

We note that the kernel per individual approach shows wider variances as the population size grows. Since the approach relies heavily on an external process (i.e. CUDA compiler), we posit that it is more vulnerable to the external and environmental factors that can affect the execution time stochastically.

5 Related Works

The use of an interpreter has been suggested as a way to scale up GP on GPUs: Langdon and Banzhaf implemented an RPN-based interpreter for GP regression [1]. Wilson and Banzhaf implemented an entire Linear GP system on GPUs, parallelising not only the fitness evaluation but also the GP mutation [11]. Both approaches have been developed with earlier incarnations of GPGPU frameworks and do not benefit from the high level programming support of contemporary frameworks. Our work exploits the modern GPU development framework to compare approaches such as kernel per individual.

Other applications of GP in SBSE involves evolving not just expressions but arbitrary code [2, 3, 9]. To parallelise GP for these applications, we need to be able to execute arbitrary code on GPU. The existence of I/O operations or system calls prevents such use of GPGPU. However, there are ongoing works that attempt to overcome the limitations of GPU environment. For example, Silberstein et al. have tried to interface the host file system with GPU kernels [6].

6 Conclusion

This paper evaluates three different parallelisation approaches for GP fitness evaluation on GPU: kernel per individual, kernel per generation, and kernel interpreter. The empirical study using a symbolic regression benchmark problem shows that, while the kernel per generation performs best overall, the actual performance depends on multiple factors such as the size of the population and the volume of the training data. Consequently, we advise GP practitioners to choose their parallelisation approach carefully. Future work will investigate a wider range of benchmark problems.