1 Introduction

Search Based Software Engineering (SBSE) is a promising sub-area within Software Engineering that reformulates Software Engineering problems as search-based optimisation problems (Clark et al. 2003; Harman 2007; Harman and Jones 2001). There has been much recent interest in SBSE, with over 800 papers on the topic and several recent surveys (Ali et al. 2010; Afzal et al. 2009; Räihä 2009; Harman et al. 2012).

One weakness shared by many SBSE techniques is their significant execution time. It is an inherent drawback because many meta-heuristic optimisation algorithms are designed in such a way that they evaluate a large number of potential candidates to arrive at the set of proposed solutions. This process is often computationally demanding and can render an SBSE technique infeasible in practice, because it would simply require too much time to optimise for complex real-world SE problems. Lack of scalability has been shown to be an important barrier to wider uptake of Software Engineering research (Cordy 2003; Chau and Tam 1997; Premkumar and Potter 1995).

Fortunately, many of the algorithms used in SBSE, such as the most widely used evolutionary algorithms (Harman et al. 2012), are classified as ‘embarrassingly parallel’ due to their inherent potential for parallelism. The high computational requirement created by the necessity to examine many solutions does not necessitate long elapsed time as the examinations can be done in parallel. The computation process (fitness evaluation) for each candidate solution is identical, thereby making the overall process well-suited to Single Instruction Multiple Data (SIMD) parallelism.

The use of multicore computing is rapidly becoming commonplace, with very widely available and inexpensive platforms that offer several hundreds of processing elements that implement SIMD parallelism. Furthermore, it is likely that we shall see significant advances in such platforms in the near future, with thousands and perhaps tens of thousands of simple processing elements becoming available within the reach of ‘standard’ desktop computation.

Many SBSE fitness functions are ideally-suited to such simple processing elements. However, there has been little work on multicore SBSE. The first authors to suggest the use of multicore SBSE were Mitchell et al. (2001) who used a distributed architecture to parallelise modularisation through the application of search-based clustering. Subsequently, Mahdavi et al. (2003) used a cluster of standard PCs to implement a parallel hill climbing algorithm. More recently, Asadi et al. (2010) used a distributed architecture to parallelise a genetic algorithm for the concept location problem. However, hitherto, no authorsFootnote 1 have used General Purpose computing on Graphical Processing Units (GPGPU) for SBSE.

In this paper we propose GPGPU SBSE; the use of GPGPU devices to achieve multicore execution of SBSE, using simple fitness computations mapped across the multiple processing elements of standard GPGPU architectures. We report results for the application of GPGPU SBSE to the multi-objective regression test selection problem (also known as test case minimization).

In order to apply GPGPU to Software Engineering optimisation problems, the specialized graphics manipulation hardware needs to be harnessed for a purpose for which it was not originally designed. Fortunately, the recent trend towards ‘general purpose’ graphics processing units lends itself well to this task. The underlying firmware typically implements certain matrix manipulation operations very efficiently. The key to unlocking the potential of GPGPU therefore lies in the ability to reformulate the optimisation problem in terms of these specific matrix manipulation operations.

This paper focusses on the problem of Search Based Regression Testing, which is one problem in the general area of Search Based Software Testing. Regression Testing is concerned with the process of re–testing software after change. After each change to the system, the pool of available test data needs to be re-executed in order to check whether change has introduced new faults. Regression Testing therefore seeks to answer the question ‘has the software regressed?’. There have been several survey papers on Regression Testing applications and techniques that provide a more detailed treatment (Harrold and Orso 2008; Yoo and Harman 2012a; Engström et al. 2009).

In search based regression testing, the goal is to use search based optimisation algorithms to find optimal sets of test cases (regression test suite minimisation Yoo and Harman 2007) or to order test cases for regression testing (regression test prioritisation Walcott et al. 2006; Li et al. 2007). This paper concentrates upon the former problem of regression test minimisation. Recent results have shown that this is a promising area of SBSE application; the results obtained from the SBSE algorithms have been shown to be human competitive (de Souza et al. 2010).

Fast regression test minimisation is an important problem for practical software testers, particularly where large volumes of testing are required on a tight build schedule. For instance, the IBM middleware product used as one of the systems in the empirical study in this paper is a case in point. While it takes over four hours to execute the entire test suite for this system, the typical smoke test scenario performed after each code submit is assigned only an hour or less of testing time, forcing the tester to select a subset of tests from the available pool. A multi-objective approach to test suite minimisation (Yoo and Harman 2007) provides an ideal solution as it can recommend subsets of tests that can be executed within different time budgets. However, as the selection of tests in the smoke tests is not static and depends on the code submitted, the given time budget should account for both the computation involved in test suite minimisation and for running the tests. Therefore it is important that test suite optimization will be done in a small fraction of the time, thereby allowing sophisticated minimisation to be used on standard machines.

The paper modifies three widely used evolutionary algorithms (SPEA2, NSGA2 and the Two Archive Algorithm) for the multi-objective regression test minimisation problem. The algorithms are modified to support implementation on a GPU by transforming the fitness evaluation of the population of individual solutions into a matrix-multiplication problem, which is inherently parallel and renders itself very favourably to the GPGPU approach. This transformation to matrix-multiplication is entirely straightforward and may well be applicable to other SBSE problems, allowing them to benefit from similar scale-ups to those reported in this paper.

The modified algorithms have been implemented using OpenCL technology, a framework for GPGPU. The paper reports the results of the application of the parallelised GPGPU algorithms on 13 real-world programs, including widely studied, but relatively small examples from the Siemens’ suite (Hutchins et al. 1994), through larger more realistic real-world examples from the Software-Infrastructure Repository (SIR) for testing (Do et al. 2005), and on a very large IBM middleware regression testing problem.

The primary contributions of the paper are as follows:

  1. 1.

    The paper presents results for real-world instances of the multi-objective test suite minimisation problem. The results indicate that dramatic speed-up is achievable. For the systems used in the empirical study, speed-ups over 25× were observed. The empirical evidence suggests that, for larger problems where the scale-up is the most needed, the degree of speed-up is the most dramatic; a problem that takes over an hour using conventional techniques, can be solved in minutes using the GPGPU approach. This has important practical ramifications because regression testing cycles are often compressed: overnight build cycles are not uncommon.

  2. 2.

    The paper studies three different multi-objective evolutionary algorithms based on both GPU- and CPU-based parallelisation methods to provide robust empirical evidence for the scalability conferred by the use of GPGPU. The GPGPU parallelisation technique maintained the same level of speed-up across all algorithms studied. The empirical evidence highlights the limitations of CPU-based parallelisation: with smaller problems, multi threading overheads erode the speed-up, whereas with larger problems it fails to scale as well as GPU-based parallelisation.

  3. 3.

    The paper explores the factors that influence the degree of speed-up achieved, revealing that both program size and test suite size are closely correlated to the degree of speed-up achieved. The data have a good fit to a model for which increases in the degree of scale-up achieved are logarithmic in both program and test suite size.

The rest of the paper is organised as follows. Section 2 presents background material on test suite minimisation and GPGPU-based evolutionary computation. Section 3 describes how the test suite minimisation problem is re-formulated for a parallel algorithm, which is described in detail in Section 4. Section 5 describes the details of the empirical study, the results of which are analysed in Section 6. Section 7 discusses threats to validity and Section 8 presents the related work. Section 9 concludes.

2 Background

Multi-objective Test Suite Minimisation

The need for test suite minimisation arises when the regression test suite of an existing software system grows to such an extent that it may no longer be feasible to execute the entire test suite (Rothermel et al. 2002b). In order to reduce the size of the test suite, any redundant test cases in the test suite need to be identified and removed. One widely accepted criterion for redundancy is defined in relation to the coverage achieved by test cases (Rothermel et al. 1998; Black et al. 2004). If the test coverage achieved by test case t 1 is a subset of the test coverage achieved by test case t 2, it can be said that the execution of t 1 is redundant as long as t 2 is also executed. The aim of test suite minimisation is to obtain the smallest subset of test cases that are not redundant with respect to a set of test requirements. More formally, test suite minimisation problem can be defined as follows (Yoo and Harman 2012a):

2.1 Test Suite Minimisation Problem

Given

A test suite of m tests, T, a set of l test requirements {r 1, ..., r l }, that must be satisfied to provide the desired ‘adequate’ testing of the program, and subsets of T, T i s, one associated with each of the r i s such that any one of the test cases t j belonging to T i can be used to achieve requirement r i .

Problem

Find a minimal representative set,Footnote 2 T′, of test cases from T that satisfies all r i s.

The testing criterion is satisfied when every test-case requirement in {r 1, ..., r l } is satisfied. A test-case requirement, r i , is satisfied by any test case, t j , that belongs to T i , a subset of T. If we represent test cases as vertices of a bipartite graph on the left side, and the requirements on the right side, and the satisfiability relationship as edges between two sides, the minimal representative set of test cases is the hitting set of T i s (i.e. the subset of vertices on the left, the union of whose connected right side vertices equals the set of all requirements). Furthermore, in order to maximise the effect of minimisation, T′ should be the minimal hitting set of T i s. The minimal representative set problem is an NP-complete problem as is the dual problem of the minimal set cover problem (Garey and Johnson 1979).

The NP-hardness of the problem encouraged the use of heuristics and meta-heuristics. The greedy approach (Offutt et al. 1995) as well as other heuristics for minimal hitting set and set cover problem (Harrold et al. 1993; Chen and Lau 1995) have been applied to test suite minimisation but these approaches were not cost-cognisant and only dealt with a single objective (test coverage). With the single-objective problem formulation, the solution to the test suite minimisation problem is one subset of test cases that maximises the test coverage with minimum redundancy.

Later, the problem was reformulated as a multi-objective optimisation problem (Yoo and Harman 2007). With the multi-objective problem formulation, the solution to the test suite minimisation problem is not just a single solution but a set of non-dominated, Pareto-efficient solutions. This set of solutions reveals the trade-off between test coverage and the cost of testing that is specific to the test suite in consideration. For example, with the solution to the multi-objective test suite minimisation problem, it is possible not only to know what the minimal subset that achieves the maximum test coverage is, but also to know how much test coverage is possible for any given testing budget.

Since the greedy algorithm may not always produce Pareto optimal solutions for multi-objective test suite minimisation problems, Multi-Objective Evolutionary Algorithms (MOEAs) have been applied (Maia et al. 2009; Yoo and Harman 2007). While this paper studies three selected MOEAs, the principle of parallelising fitness evaluation of multiple solutions in the population of an MOEA applies universally to any MOEA.

GPGPU and Evolutionary Algorithms

Graphics cards have become a compelling platform for intensive computation, with a set of resource-hungry graphic manipulation problems that have driven the rapid advances in their performance and programmability (Owens et al. 2007). As a result, consumer-level graphics cards boast tremendous memory bandwidth and computational power. For example, ATI Radeon HD4850 (the graphics card used in the empirical study in the paper), costing about $150 as of April 2010, provides 1000GFlops processing rate and 63.6 GB/s memory bandwidth. Graphics cards are also becoming faster more quickly compared to CPUs. In general, it has been reported that the computational capabilities of graphics cards, measured by metrics of graphics performance, have compounded at the average yearly rate of 1.7× (rendered pixels/s) to 2.3× (rendered vertices/s) (Owens et al. 2007). This significantly outperforms the growth in traditional microprocessors; using the SPEC benchmark, the yearly rate of growth for CPU performance has been measured at 1.4× by a recent survey (Ekman et al. 2005).

The disparity between the two platforms is caused by the different architecture. CPUs are optimised for executing sequential code, whereas GPUs are optimised for executing the same instruction (the graphics shader) with data parallelism (different objects on the screen). This Single Instruction Multiple Data (SIMD) architecture facilitates hardware-controlled massive data parallelism, which results in the higher performance for certain types of problems in which a large dataset has to be submitted to the same operations.

It is precisely this massive data-parallelism of General-Purpose computing on Graphics Processing Units (GPGPU) that makes GPGPU as an ideal platform for parallel evolutionary algorithms. Many of these algorithms require the calculation of fitness (single instruction) for multiple individual solutions in the population pool (multiple data). Early work has exploited this potential for parallelism with both single- and multi-objective evolutionary algorithms (Tsutsui and Fujimoto 2009; Wilson and Banzhaf 2009; Wong 2009). However, most existing evaluation has been performed on benchmark problems rather than practical applications.

3 Parallel Formulation of MOEA for Test Suite Minimisation

Parallel Fitness Evaluation

The paper considers, for parallelisation, a multi-objective test suite minimisation problem from existing work (Yoo and Harman 2007). In order to parallelise test suite minimisation, the fitness evaluation of a generation of individual solutions for the test suite minimisation problem is re-formulated as a matrix multiplication problem. Instead of computing the two objectives (i.e. coverage of test requirements and execution cost) for each individual solution, the solutions in the entire population are represented as a matrix, which in turn is multiplied by another matrix that represents the trace data of the entire test suite. The result is a matrix that contains information for both test goal coverage and execution cost. While the paper considers structural coverage as the test goal, the proposed approach is equally applicable to any other testing criteria, either coverage generated such as data-flow coverage and functional coverage or even those generated manually, provided that there is a clear mapping between tests and the test requirements they achieve.

More formally, let matrix A contain the trace data that capture the test requirements achieved by each test; the number of rows of A equals the number of test requirements to be covered, l, and the number of columns of A equals the number of test cases in the test suite, m. Entry a i,j of A stores 1 if the test goal f i was executed (i.e. covered) by test case t j , 0 otherwise.

$$ A = \left(\begin{array}{ccc} a_{1,1} & \ldots & a_{1,m}\\ a_{2,1} & \ldots & a_{2,m}\\ & \ldots & \\ a_{l,1} & \ldots & a_{l,m}\\ \end{array}\right) $$

The multiplier matrix, B, is a representation of the current population of individual solutions that are being considered by a given MOEA. Let B be an m-by-n matrix, where n is the size of population for the given MOEA. Entry b j,k of B stores 1 if test case t j is selected by the individual p k , 0 otherwise. In other words, each column in matrix B corresponds to a vector of decision variables that denote the selected test cases.

$$ B = \left(\begin{array}{ccc} b_{1,1} & \ldots & b_{1,n}\\ b_{2,1} & \ldots & b_{2,n}\\ & \ldots & \\ b_{m,1} & \ldots & b_{m,n}\\ \end{array}\right) $$

The fitness evaluation of the entire generation is performed by the matrix multiplication of C = A ×B. Matrix C is a l-by-n matrix; entry c i,k of C denotes the number of times test goal f i was covered by different test cases that had been selected by the individual p k .

Cost and Coverage

In order to incorporate the execution cost as an additional objective to the MOEA, the basic reformulation is extended with an extra row in matrix A. The new matrix, A′, is an l + 1 by m matrix that contains the cost of each individual test case in the last row. The extra row in A′ results in an additional row in C′ which equals to A′ ×B as follows:

$$A' = \left(\begin{array}{ccc} a_{1,1} & \ldots & a_{1,m}\\ a_{2,1} & \ldots & a_{2,m}\\ & \ldots & \\ a_{l,1} & \ldots & a_{l,m}\\ \mathrm{cost}(t_1) & \ldots & \mathrm{cost}(t_m)\\ \end{array}\right) \qquad C' = \left(\begin{array}{ccc} c_{1,1} & \ldots & c_{1,n}\\ c_{2,1} & \ldots & c_{2,n}\\ & \ldots & \\ c_{l,1} & \ldots & c_{l,n}\\ \mathrm{cost}(p_1) & \ldots & \mathrm{cost}(p_n)\\ \end{array}\right) $$

By definition, an entry c l + 1,k in the last row in C′ is defined as \(c_{l + 1,k} = \sum_{j=1}^{m}{a_{l+1, j} \cdot b_{\!j,k}} = \sum_{j=1}^{m}{\mathrm{cost}(t_j) \cdot b_{\!j,k}}\). That is, c l + 1, k equals the sum of costs of all test cases selected (b j,k equals 1) by the k-th individual solution p k , i.e. cost(p k ). Similarly, after the multiplication, the k-th column of matrix C′ contains the coverage of test requirements achieved by individual solution p k . However, this information needs to be summarised into a percentage coverage, using a step function f as follows: \(\mathrm{coverage}(p_k) = \frac{\sum_{i=1}^{m}{f(c_{i,k})}}{l}\), f(x) = 1 (x > 0) or 0 (otherwise). The role of the step function is to translate the linear sum of how many times a test goal has been covered into boolean coverage of whether it was covered or not.

The cost objective can be calculated as a part of the matrix multiplication because it can be linearly computed from the decision variable vectors (columns of matrix B). Other objectives that share the linearity may also be computed using matrix multiplication. However, the coverage of test requirements requires a separate step to be performed. Each column of C′ contains the number of times individual testing requirements were covered by the corresponding solution; in order to calculate the coverage metric for a solution, it is required to iterate over the corresponding column of C′. The coverage calculation is highly parallel in nature because each column can be independently iterated over and, therefore, can take the advantage of GPGPU architecture by running multiple threads.

4 Algorithms

This section presents the parallel fitness evaluation components for CPU and GPU and introduces the MOEAs that are used in the paper.

Parallel Matrix Multiplication Algorithm

Matrix multiplication is inherently parallelisable as the calculation for an individual entry of the product matrix does not depend on the calculation of any other entry. Algorithm 1 shows the pseudo-code of the parallel matrix multiplication algorithm using the matrix notation in Section 3.

Algorithm 1 uses one thread per element of matrix C′, resulting in a total of (l + 1)·n threads. Each thread is identified with unique thread id, tid. Given a thread id, Algorithm 1 calculates the corresponding element of the resulting matrix, \(C^\prime_{y,x}\) given the width of matrix A, w A , i.e. \(y = \frac{tid}{w_B}\) and \(x = tid \mod w_B\).

Coverage Collection Algorithm

After matrix-multiplication using Algorithm 1, coverage information is collected using a separate algorithm, pseudo-code of which is shown in Algorithm 2. Unlike Algorithm 1, the coverage collection algorithm only requires n threads, i.e. one thread per column in matrix C .

The loop in Line (2) and (3) counts the number of structural elements that have been executed by the individual solution p tid . The coverage is calculated by dividing this number by the total number of structural elements that need to be covered.

While coverage information requires a separate collection phase, the sum of costs for each individual solution has been calculated by Algorithm 1 as a part of the matrix multiplication following the extension in Section 3.

5 Experimental Setup

5.1 Research Questions

This section presents the research questions studied in the paper. RQ1 and RQ2 concern the scalability achieved by the speed-up through the use of GPGPU, whereas RQ3 concerns the practical implications of the speed-up and the consequent scalability to the practitioners.

RQ1. Speed-up

what is the speed-up factor of GPU- and CPU-based parallel versions of MOEAs over the untreated CPU-based version of the same algorithms for multi-objective test suite minimisation problem?

RQ2. Correlation

what are the factors of the problem instances that have the highest correlation to the speed-up achieved, and what is the correlation between these factors and the resulting speed-up?

RQ3. Insight

what are the realistic benefits of the scalability that is achieved by the GPGPU approach to software engineers?

RQ1 is answered by observing the dynamic execution time of the parallel versions of the studied algorithms as well as the untreated single-threaded algorithms. For RQ2, two factors constitute the size of test suite minimisation problem: the number of test cases in the test suite and the number of test requirements in System Under Test (SUT) that need to be covered. The speed-up values measured for RQ1 are statistically analysed to investigate the correlation between the speed-up and these two size factors. RQ3 is answered by analysing the result of test suite minimisation obtained for a real-world testing problem.

5.2 Subjects

Table 1 shows the subject programs for the empirical study. Twelve of the programs and test suites are from the Software Infrastructure Repository (SIR) (Do et al. 2005). In order to obtain test suites with varying sizes ranging from a few hundred to a few thousand test cases, the study includes multiple test suites for some subject programs. For printtokens and schedule, smaller test suites are coverage-adequate test suites, whereas larger test suites include all the available test cases. To avoid selection bias, four smaller test suites were randomly selected from the pool of available tests for each program. In the case of space, SIR contains multiple coverage-adequate test suites of similar sizes; four test suites were selected randomly.

Table 1 Subject programs used for the empirical study

The subjects also include a large system-level test suite from IBM. For this subject, the coverage information was maintained at the function level. The test suite contains only 181 test cases, but these test cases are used to cover 61,770 functions in the system.

Each test suite has an associated execution cost dataset. For the subject programs from SIR, the execution costs were measured by observing the number of instructions required by the execution of tests. This was performed using a well-known profiling tool, valgrind (Nethercote and Seward 2007), which executes the given program on a virtual processor. For ibm, physical wall-clock time data, measured in seconds, were provided by IBM. The entire test suite for ibm takes more than 4 h to execute.

5.3 Implementation and Hardware

Implementation

The paper uses the open source Java MOEA library, jMetal (Durillo et al. 2006, 2010) as a library of untreated versions of MOEAs: NSGA-II and SPEA2 are included in the jMetal library; The Two Archive algorithm has been implemented using the infrastructure provided by the library. The untreated versions of MOEAs evaluate the fitness of individual solutions in the population one at a time, which incurs method invocations regarding the retrieval of coverage and cost information.

The GPGPU-based parallel versions of these three algorithms are implemented in the OpenCL GPGPU framework using a Java wrapper called JavaCL (Chafik 2009). The CPU-based parallel versions of the three algorithms use a parallel programming library for Java called JOMP (Bull et al. 2000). JOMP allows parameterised configuration of the number of threads to use. The parallelisation is only applied to the fitness evaluation step because it is not clear whether certain steps in the studied algorithms, such as sorting, may yield sufficient efficiency when performed in parallel.

All three algorithms are configured with population size of 256 following the standard recommendation to set the number of threads to multiples of 32 or 64 (ATI Stream Computing 1010). The archive size for SPEA2 and The Two Archive algorithm is also set to 256. The stopping criterion for all three algorithms is to reach the maximum number of fitness evaluations, which is set to 64,000, allowing 250 generations to be evaluated.

All three algorithms solve the test suite minimisation problem by selecting Pareto-optimal subsets of test cases, represented by binary strings that form columns in matrix B in Section 3. The initial population is generated by randomly setting the individual bits of these binary strings so that the initial solutions are randomly distributed in the phenotype space.

NSGA-II and SPEA2 use the binary tournament selection operator. The Two Archive algorithm uses the uniform selection operator as described in the original paper (Praditwong and Yao 2006): the selection operator first selects one of the two archives with equal probability and then selects one solution from the chosen archive with uniform probability distribution. All three algorithms use the single-point crossover operator with probability of crossover set to 0.9 and the single bit-flip mutation operator with the mutation rate of \(\frac{1}{n}\) where n is the length of the bit-string (i.e. the number of test requirements).

Hardware

All algorithms have been evaluated on a machine with a quad-core Intel Core i7 CPU (2.8 GHz clock speed) and 4 GB memory, running Mac OS X 10.6.5 with Darwin Kernel 10.6.0 for x86_64 architecture. The Java Virtual Machine used to execute the algorithms is Java SE Runtime with version 1.6.0_22. The GPGPU-based versions of MOEAs have been evaluated on an ATI Radeon HD4850 graphics card with 800 stream processors running at 625 MHz clock speed and 512 MB GDDR3 onboard memory.

5.4 Evaluation

The paper compares three MOEAs, each with five different configurations: the untreated configuration (hereafter refered to CPU), the GPGPU configuration (GPU) and the JOMP-based parallel configurations with 1, 2, and 4 threads (JOMP1/2/4). The configuration with one thread (JOMP1) is included to observe the speed-up achieved by evaluating the fitness of the entire population using matrix multiplication, instead of evaluating the solutions one by one as in the untreated versions of MOEA. Any speed-up achieved by JOMP1 over CPU is, therefore, primarily achieved by the code-level optimisation that removes the method invocation overheads. On the other hand, JOMP1 does incur an additional thread management overhead.

In total, there are 15 different configurations (three algorithms with five configurations each). For each subject test suite, the 15 configurations were executed 30 times in order to cater for the inherent randomness in dynamic execution time, measured using the system clock. The speed-up is calculated by dividing the amount of the time that the CPU configuration required with the amount of the time that parallel configurations required. While we discuss the speed-up values only using the total execution time, we also provide observations of the three parts break-down of the total execution time (Time total) for each algorithm as below:

  • Initialisation (Time init): the time it takes for the algorithm to initialise the test suite data in a usable form; for example, GPU configurations of MOEAs need to transfer the test suite data onto the graphics card.

  • Fitness Evaluation (Time fitness): the time it takes for the algorithm to evaluate the fitness values of different generations during its runtime.

  • Remaining (Time remaining): the remaining parts of the execution time, most of which is used for archive management, genetic operations, etc.

6 Results

This section presents the speed-up measurements of the single-threaded, CPU-based multi-threaded, and GPGPU-based multi-threaded approaches and analyses the correlation between the speed-up and problem size.

6.1 Speed-up

Figure 1 presents the mean paired speed-up results of all configurations. The mean paired speed-up values were calculated by dividing the execution time of CPU with the corresponding execution time of the parallel configurations for each of the 30 observations. Tables 2, 3 and 4 contain the speed-up data in more detail, whereas the statistical analysis of the raw information can be obtained from Tables 12, 13 and 14 in the Appendix.

Fig. 1
figure 1

Mean paired speed-ups achieved by different algorithms and parallel configurations

Table 2 Speed-up results for NSGA-II
Table 3 Speed-up results for SPEA2
Table 4 Speed-up results for TAEA

Overall, the observed paired mean speed-up ranges from 0.47× to 25.09×. While the different archive management strategies used by each MOEAs make it difficult to compare the execution time results directly (because the different amount of heap used by each may affect JVM’s performance differently), it is possible to observe the general trend that the speed-up tends to increase as the problem size grows. The speed-up values below 1.0 show that the overhead of thread management and the additional communication can be detrimental for the problems of sufficiently small size. However, as the problem size grows, JOMP1 becomes faster than CPU with all algorithms, indicating that the amount of reduced method call overhead eventually becomes greater that the thread management overhead.

With the largest dataset, ibm, the GPU configuration of NSGA-II reduces the average execution time of CPU, 4,347 s (1 h 12 min and 27 s), to the average of 174 s (2 min and 54 s). The speed-up remains consistently above 3.0× for all three algorithms if the problem size is larger than that of flex, i.e. about 400,000 (103 tests × 3,965 test requirements).

To provide inferential statistical analysis of the observed execution time data have been compared using the Mann-Whitney ‘U’ test. The Mann-Whitney ‘U’ test is a non-parametric statistical hypothesis test, i.e. it allows the comparison of two samples with unknown distribution. The execution time data observed with JOMP1/2/4 and GPU configurations were compared to those from CPU configuration. The null hypothesis is that there is no difference between the parallel configurations and CPU configuration; the alternative hypothesis is that the execution time of the parallel configurations is smaller than that of CPU configuration.

Tables 5, 6 and 7 present the resulting p-values. With JOMP1 and JOMP2 configurations, the alternative hypothesis is rejected for 39 and 12 cases at the confidence level of 95 %, respectively, out of of 42 cases with subjects with problem sizes smaller than that of flex, providing evidence that the parallel configurations required more time than the untreated configuration (CPU). With JOMP4 and GPU configurations, the null hypothesis is universally rejected for all subjects with problem sizes larger than that of flex, providing strong evidence that the parallel configurations required less time than the untreated configuration (CPU). The particular results are naturally dependent on the choice of the graphics card that has been used for the experiment. However, these results, taken together, provide strong evidence that, for test suite minimisation problems of realistic sizes, the GPGPU approach can provide a speed-up of at least 3.0×. This finding answers RQ1.

Table 5 Mann-Whitney U test for NSGA-II
Table 6 Mann-Whitney U test for SPEA2
Table 7 Mann-Whitney U test for TAEA

6.2 Correlation

Regarding RQ2, one important factor that contributes to the level of speed-up is the speed of each individual computational unit in the graphics card. The HD4850 graphics card used in the experiment contains 800 stream processor units that are normally used for the computation of geometric shading. Each of these processors executes a single thread of Algorithm 1, of which there exist more than 800. Therefore, if the individual stream processor is as powerful as a single core of the CPU, the absolute upper bound on speed-up would be 800. In practice, the individual processors run with a clock speed of 625 MHz, which makes them much slower and, therefore, less powerful than a CPU core. This results in speed-up values lower than 800.

In order to answer RQ2, statistical regression analysis was performed on the correlation between the observed speed-up and the factors that constitute the size of problems.

Three size factors have been analysed for the statistical regression: size of test goal set, size of test suite and their product. The number of test requirements and the number of test cases are denoted by l and m respectively, following the matrix notation in Section 3: l is proportional to the number of threads the GPGPU-version of the algorithm has to execute (as the size of the matrix C′ is l-by-n and n is fixed); m denotes the amount of computation that needs to be performed by a single thread (as each matrix-multiplication kernel computes a loop with m iterations). In addition to these measurement, another size factor z = l ·m is considered to represent the perceived size of the minimisation problem. Table 8 shows the results of Spearman’s rank correlation analysis between size factors and observed speed-ups.

Table 8 Spearman’s rank correlation coefficients between three size factors and speed-ups

Spearman’s rank correlation is a non-parametric measure of how well the relationship between two variables can be described using a monotonic function. As one variable increases, the other variable will tend to increase monotonically if the coefficient is close to 1, whereas it would decrease monotonically if the coefficient is close to −1.

In all algorithms and configurations, the size factor l shows the strongest positive correlation with speed-ups. The correlation coefficients for z are weaker than those for l or, in the case of JOMP1 for the Two Archive algorithm, negative. The correlation for m remains negative for all algorithms and configurations.

To gain further insights into the correlation between size and the speed-up observed, a regression analysis was performed. Factor z is considered in isolation, whereas l and m are considered together; each variable has been considered in its linear form (z, l and m) and logarithmic form (logz, logl and logm). This results in 6 different combinations of regression models. Tables 9, 10 and 11 in the Appendix present the detailed results of regression analysis for the three algorithms respectively.

With a few exceptions of very small margins (NSGA-II with JOMP4 and SPEA2 with JOMP1, JOMP4, and GPU), the model with the highest r 2 correlation for all algorithms and configurations is S p  = αlogl + βlogm + γ. Figure 2 shows the 3D plot of this model for the GPU and JOMP4 configuration of the Two Archive algorithm.

Fig. 2
figure 2

3D-plot of regression model S p  = αlogl + βlogm + γ for GPU (solid line) and JOMP4 (dotted line) configurations for two archive algorithm

The observed trend is that the inclusion of logl results in higher correlation, whereas models that use l in its linear form tend to result in lower correlation. This supports the results of Spearman’s rank correlation analysis in Table 8. The coefficients for the best-fit regression model for GPU, S p  = αlogl + βlogm + γ, can explain why the speed-up results for space test suites are higher than those for test suites with z values such as tcas, gzip and replace. Apart from bash and ibm, space has the largest number of test requirements to cover, i.e. l. Since α is more than twice larger than β, a higher value of l has more impact to S p than m.

Based on the analysis, RQ2 is answered as follows: the observed speed-up shows a strong linear correlation to the log of the number of test requirements to cover and the log of the number of test cases in the test suite. The positive correlation provides evidence that GPU-based parallelisation scales up.

Furthermore, within the observed data, the speed-up continues to increase as the problem size grows, which suggests that the graphics card did not reach its full computational capacity. It may be that, for larger problems, the speed-up would be even greater than those observed in this paper. The finding that the scalability factor increases with overall problem size is a very encouraging finding; as the problem gets harder, the solution gets better.

6.3 Insights

This section discusses a possible real-world scenario in which the parallelisation of multi-objective test suite minimisation can have a high impact. A smoke test is a testing activity that is usually performed in a very short window of time to detect the most obvious faults, such as system crashes. IBM’s smoke test practice is to allow from 30 to 60 min of time to execute a subset of tests from a large test suite that would require more than 4 h to execute in its entirety.

Using static smoke test suite is problematic as running the same tests at every regression greatly reduces the likelihood of finding bugs. Therefore it is important to recalculate the most relevant smoke test suite given the changes to the code. It is for this reason that the cost of computation, especially the actual time it takes, becomes very important.

Figure 3 shows two possible smoke test scenarios based on the results of CPU and GPGPU configurations of NSGA-II. It is a plot of how much test requirement coverage can be achieved during the given time, including the time needed for multi-objective test suite minimisation. The solid line represents the scenario based on the GPGPU configuration of the algorithm, whereas the dotted line represents the scenario based on the CPU configuration. The beginning flat segment at the bottom shows the time each configuration spends on the optimisation process; the curved segment shows the trade-off between time and test coverage achieved by the optimised test suite. Since the CPU configuration of NSGA-II takes longer than 60 min to terminate, it cannot contribute to any smoke test scenario that must be completed within 60 min. On the other hand, the GPGPU configuration allows the tester to consider a subset of tests that can be executed within 30 min. If the grey region was wider than Fig. 3, the difference between two configurations would have been even more dramatic.

Fig. 3
figure 3

Comparison of smoke test scenarios for IBM System (ibm). The solid line shows the trade-offs between time and test coverage when GPU configuration of NSGA-II is used, whereas the dotted line shows that of CPU. The grey area shows the interesting trade-off that the CPU configuration fails to exploit within 60 min

This answers RQ3 as follows: a faster execution of optimisation algorithms enables the tester not only to use the algorithms but also to exploit their results more effectively. This real-world smoke test example from IBM demonstrates that scale-ups accrued from the use of GPGPU are not only sources of efficiency improvement, they can also make possible test activities that are simply impossible without this scalability.

The ability to execute a sophisticated optimisation algorithm within a relatively short time also allows the tester to consider state-of-the-art regression testing techniques with greater flexibility. The greater flexibility is obtained because the cost of the optimisation does not have to be amortised across multiple iterations. Many state-of-the-art regression testing techniques require the use of continuously changing sets of testing data, such as recent fault history (Yoo and Harman 2007) or the last time a specific test case has been executed (Kim and Porter 2002; Engström et al. 2010). In addition to the use of dynamic testing data, the previous work also showed that repeatedly using the same subset of a large test suite may impair the fault detection capability of the regression testing (Yoo et al. 2009).

7 Threats to Validity

Threats to internal validity concern the factors that could have affected the experiments in the paper. While GPGPU architecture has been researched for some time, the commercially available GPGPU frameworks such as CUDA and OpenCL are still in their early stages and, therefore, may contain faults in the implementation.

The GPGPU matrix-multiplication routine has been manually tested and validated with additional data apart from the test suites chosen for the empirical study. Regarding the precision of the GPGPU-based calculation, the aim of the paper is to investigate the potential speed-up that can be gained by using GPGPU, rather than to consider the effectiveness of the actual test suite minimisation in the context of regression testing. Therefore, the precision issue does not constitute a major issue for the aim of this study.

Threats to external validity concern any factor that might prevent the generalisation of the result presented by the paper. Since the performance of GPGPU computing is inherently hardware specific, the results reported in the paper may not be reproducible in their exact form using other combinations of hardware components. However, with the advances in graphics card architecture, it is more likely that experiments with the same approach with newer graphics card will only improve the speed-up results reported in the paper.

It should be also noted that optimising test suite minimisation using evolutionary computation is an inherently ideal candidate for GPGPU computation as the reformulated problem, matrix-multiplication, is highly parallel in nature. Other problems in search-based software engineering may not render themselves as easily as the test suite minimisation problem. However, this issue is inherent in any attempts to parallelise a software engineering technique and not specific to GPGPU approach.

Threats to construct validity arise when measurements used in the experiments do not capture the concepts they are supposed to represent. The speed-up calculation was based on the measurements of execution time for both algorithms using system clock, which was chosen because it represents the speed of a technique to the end-user. Regarding the measurements of problem size used for the regression analysis, there may exist more sophisticated measurements of test suites and program source code that correlates better with the speed-up. However, both the number of test requirements and the number of test cases are probably the most readily available measurements about source code and test suites and provide a reasonable starting point for this type of analysis.

8 Related Work

Recent developments in graphics hardware provide an affordable means of parallelism: not only is the hardware more affordable than that of multiple PCs but also the management cost is much smaller than that required for a cluster of PCs, because it depends on a single hardware component. GPGPU has been successfully applied to various scientific computations (Boyer et al. 2009; Govindaraju et al. 2006), while Langdon and Banzhaf (2008) recently used GPGPU for performance improvements in optimization problems. However, these techniques have not been applied to Search Based Software Engineering problems, motivating the work of this paper, i,e., the use of GPGPU to achieve performance improvements in SBSE.

As a first instance of GPGPU SBSE, we study the SBSE application domain of regression testing. Regression test selection (also known as test suite minimisation) aims to reduce the number of tests to be executed by calculating the minimum set of tests that are required to satisfy the given test requirements. The problem has been formulated as the minimal hitting set problem (Harrold et al. 1993), which is NP-hard (Garey and Johnson 1979).

Various heuristics for the minimal representative set problem, or the minimal set cover problem (the dual of the former), have been suggested for test suite minimisation (Chen and Lau 1996; Offutt et al. 1995). However, empirical evaluations of these techniques have reported conflicting views on the impact on fault detection capability: some reported no impact (Wong et al. 1998, 1999) while others reported compromised fault detection capability (Rothermel et al. 2002a, b).

One potential reason why test suite minimisation has a negative impact on the fault detection capability is the fact that the criterion for minimisation is structural coverage; achieving coverage alone may not be sufficient for revealing faults. This paper uses the multi-objective approach based on Multi-Objective Evolutionary Algorithm (MOEA) introduced by Yoo and Harman (2007); the paper also presents the first attempt to parallelise test suite minimisation with sophisticated criteria for scalability. Multi-objective forms of regression testing problems are increasingly popular in SBSE work, since most real-world regression testing scenarios need to satisfy multiple objectives (Harman 2011). Our SBSE approach to regression testing has also been used at Google (Yoo et al. 2011b).

Population-based evolutionary algorithms are ideal candidates for GPGPU parallelisation (Owens et al. 2007) and existing work has shown successful implementations for classical problems. Tsutsui and Fujimoto implemented a single-objective parallel Genetic Algorithm (GA) using GPU for the Quadratic Assignment Problem (QAP) (Tsutsui and Fujimoto 2009). Wilson and Banzaf implemented a linear Genetic Programming (GP) algorithm on XBox360 game consoles (Wilson and Banzhaf 2009). Langdon and Banzaf implemented GP for GPU using an SIMD interpreter for fitness evaluation (Langdon and Banzhaf 2008). Wong implemented an MOEA on GPU and evaluated the implementation using a suite of benchmark problems (Wong 2009). Wong’s implementation parallelised not only the fitness evaluation step but also the parent selection, crossover & mutation operator as well as the dominance checking.

Despite the highly parallelisable nature of many techniques used in SBSE, few parallel algorithms have been used. Of 763 papers on SBSE (Zhang 2011) only three present results for parallel execution of SBSE. Mitchell et al. used a distributed architecture for their clustering tool Bunch (Mitchell et al. 2001). Mahdavi et al. (2003) used a cluster of standard PCs to implement a parallel hill climbing algorithm. Asadi et al. used a distributed Server-Client architecture for Concept Location problem (Asadi et al. 2010). All three of these previous approaches use a distributed architecture that requires multiple machines. The present paper is the first work on SBSE that presents results for the highly affordable parallelism based on GPGPU.

There is existing work that re-implements meta-heuristic algorithms on GPGPU for non-SBSE applications. This previous work ports the meta-heuristic algorithms in their entirety (Wong 2009) to GPGPU, whereas the present paper concerns the practical improvement in scalability of SBSE, rather than the technical feasibility of GPGPU-based meta-heuristic implementation. The present paper thus re-implements only the most parallelisable module in Multi-Objective Evolutionary Algorithms (MOEAs) and performs an extensive empirical evaluation of the impact on scalability.

We believe that this approach may also prove to be applicable to many other SBSE problems. To achieve this, it will be necessary to develop new ways to port the fitness computation to the GPGPU device. For some applications, such as test data generation and re-generation problems (Ali et al. 2010; Yoo and Harman 2012b), this may not be possible, because the fitness function requires execution of the program under test; we cannot be sure that GPGPU devices will ever develop to the point that execution of arbitrary code will become possible.

However, for other SBSE applications, such as requirements optimisation (Saliu and Ruhe 2007; Zhang et al. 2011), prediction system feature selection (Kirsopp et al. 2002) and requirements sensitivity analysis (Harman et al. 2009), fitness computation remains an oft-repeated and, thereby, SIMD-friendly requirement. Also, in these application areas, the underlying Software Engineering problem may be characterised using a table (albeit a very large one). In such cases, where the SBSE problem can be formulated in terms of the optimisation of choices, based on a spreadsheet of known values, this may prove to port well onto the SIMD architecture offered by GPGPU devices.

Furthermore, for requirements optimisation problems there is known to be a close similarity between the problem representation for search based requirements and search based regression testing (Harman et al. 2012). As a result, the techniques used here may also prove to be readily applicable to these problems with little need for significant modification of our approach.

9 Conclusion

This paper presented the first results on GPGPU SBSE; the use of GPGPU-based massive parallelism for improving scalability of regression testing, based on Search-Based Software Engineering (SBSE). The advances in GPGPU architecture and the consequent availability of parallelism provide an ideal platform for improving SBSE scalability through SIMD parallelism.

The paper presents an evaluation of the GPGPU-based test suite minimisation for real-world examples that include an industry-scale test suite. This approach to GPGPU SBSE was evaluated on three popular multi-objective evolutionary algorithms. The results show that the GPGPU-based optimisation can achieve a speed-up of up to 25.09× compared to a single-threaded version of the same algorithm executed on a CPU. The highest speed-up achieved by the CPU-based parallel optimisation was 9.29×. Statistical analysis shows that the speed-up correlates to the logarithmic of the problem size, i.e. the size of the program under test and the size of the test suite. This finding indicates that as the problem becomes larger, the scalability of the proposed approach increases; a very attractive finding. Future work will include an empirical study of a wider range of test suites, as well as seeking insights into why MOEAs benefit differently from parallelisation.