1 Introduction

The software industry nowadays heavily relies on testing for improving the quality of its products. There are, of course, good reasons for adopting this practice. First, as opposed to more heavy-weight techniques such as static analysis, testing is easy to deploy and understand, and most developers are familiar with software testing processes and tools. Second, testing is scalable (i.e., millions of tests can be executed within hours even on large programs) and precise (i.e., it does not generate false alarms that impede developers’ productivity). Third, while testing cannot prove the absence of bugs, there is ample evidence that testing does find important bugs that are fixed by developers. Despite these advantages, testing is not a silver bullet since crafting good tests is a time consuming and costly process, and even then achieving high coverage and catching all defects using testing can be challenging. For example, tester to developer ratio at Microsoft is around 1-to-1, and yet important defects still escape into production. Naturally, there has been a great deal of research on alleviating these problems by developing techniques that aim to improve the automation and effectiveness (in terms of achieved coverage and defects found) of software testing.

Fig. 1.
figure 1

Class with an assertion in one method (left). Input x is not properly sanitized in method setX. Consequence: assertion can be violated by combination of method sequence and specific input values (right).

Random testing is the most basic and straightforward approach to automating software testing. Typically, it completely automatically generates and executes millions of test cases within hours, and quickly covers many statements (or branches) of a software under test (SUT). However, a drawback of random testing is that, depending on the characteristics of the SUT, the achieved coverage plateaus due to unlikely execution paths. Figure 1 gives our motivating example Java program that illustrates this point (left) together with a specific test case that triggers an assertion violation (top right). To apply random testing on the example, we generate randomized unit test shown in the middle of the right half of the figure. Clearly, it is trivial to execute this simple unit test many times, each time with a new pair of random numbers being generated. It is impossible, however, that executing it would generate inputs that violate the assertion. We would additionally need to generate more complex sequences of method calls (as is shown in lower right of the figure). Exploring both dimensions (parameter values and method sequences) randomly tends to plateau and not hit paths that require specifc combinations of method sequence and parameters values.

A more heavyweight approach could be based on symbolic execution, which leverages automatic constraint solvers to compute test inputs that cover such hard-to-cover branches. For example, the JDart  [26] dynamic symbolic execution tool when run on method generates test cases covering all branches in less than a second, thereby triggering an assertion violation. The authors also show that JDart improves coverage over random testing for a class of numerically-intensive SUTs. However, symbolic-testing-based methods mainly excel in automatically generating test inputs over primitive numeric data types, and have hence been successfully applied as either system-level (e.g., SAGE [18], KLEE [6]) or method-level (e.g., JDart  [26], JCute  [35]) test generators.

Generalizing from the above example, generating unit tests for object-oriented software poses a two-dimensional challenge: instead of taking just primitive types as input, methods in object-oriented software require a rich heap structure of class objects to be generated. While several approaches have been proposed that automatically generate symbolic heap structures [25], logical encoding of such structures results in more complex constraints that put an additional burden on constraint solvers; hence, these approaches have not yet seen wider adoption on large SUTs. On the other hand, generating heap structures by randomly creating sequences of constructor+method invocations was shown to be effective, in particular when advanced search- and feedback-directed algorithms are employed (e.g., Randoop  [29], EvoSuite [13]). It is then natural to attempt to integrate the two approaches by using random testing to perform global/macro exploration (by generating heap structures using sequences of constructor+method invocations at the level of classes) and dynamic symbolic execution to perform local/micro exploration (by generating inputs of primitive types using constraint solvers at the level of methods). In this paper, we describe, implement, and empirically evaluate such a hybrid approach.

Our hybrid approach integrates feedback-directed unit testing with dynamic symbolic execution. We leverage feedback-directed unit testing to generate constructor+method sequences that create heap structures and drive a SUT into interesting global (i.e., macro) states. We feed the generated sequences to a dynamic symbolic execution engine to compute inputs of primitive types that drive the SUT into interesting local (i.e., micro) states. We implemented this approach as a tool named JDoop,Footnote 1 which integrates feedback-directed unit testing tool Randoop  [29] with state-of-the-art dynamic symbolic execution engine JDart  [26]. Given that such an integration has not been thoroughly empirically studied in the past, we also assess the merits of this approach through a large-scale empirical evaluation.

Our main contributions are as follows:

  • We developed JDoop, a hybrid tool that integrates feedback-directed unit testing with dynamic symbolic execution to be able to experiment with large-scale automatic testing of object-oriented software.

  • We implemented a distributed benchmarking infrastructure for running experiments in isolation on a cluster of machines; this allows us to execute large-scale experiments that ensure statistical significance, and also advances the reproducibility of our results.

  • We performed an extensive empirical evaluation and comparison between random (our baseline) and hybrid testing approaches in the context of automatic testing of object-oriented software.

  • We identified several open research questions during our evaluation, performed additional targeted experiments to obtain answers to these questions, and provided guidelines for future research efforts in this area.

2 Background

We provide background on dynamic symbolic execution and feedback-directed random testing.

2.1 Dynamic Symbolic Execution

Dynamic symbolic execution [6, 17, 36] is a program analysis technique that executes a program with concrete and symbolic inputs at the same time. It systematically collects constraints over the symbolic program inputs as it is exploring program paths, thereby representing program behaviors as algebraic expressions over symbolic values. The program effects can thus be expressed as a function of such expressions.

Dynamic symbolic execution maintains—in addition to the concrete state defined by the concrete program semantics—the symbolic state, which is a tuple containing symbolic values of program variables, a path condition, and a program counter. A path condition is a conjunction of symbolic expressions over the symbolic inputs that characterizes an execution path through the program. It is generated by accumulating (symbolic) conditions encountered along the execution path, so that concrete data values that satisfy it can be used to drive its concrete execution. Path conditions are stored as a symbolic execution tree that characterizes all the paths exercised as part of the symbolic analysis.

In dynamic symbolic execution, the symbolic execution tree is built by repeatedly augmenting it with new paths that are obtained from unexplored branches in the tree. This is done by employing an exploration strategy such as depth-first, breadth-first, or random. A constraint solver is used to obtain a valuation for a yet-unexplored branch by feeding it the corresponding path condition. The new valuation drives a new iteration of dynamic symbolic execution that augments the symbolic execution tree with a new path. JDart is a dynamic symbolic execution engine that uses the Java PathFinder framework [23, 44] and for executing Java programs and recording path conditions. Maintaining the symbolic state is achieved by a customized implementation of the bytecode instructions in the JVM of Java PathFinder that performs concrete and symbolic operations simultaneously. In JDoop, we configure JDart to use the Z3  [9] constraint solver for finding concrete inputs that drive execution along previously unexplored symbolic paths.

A limitation of this approach is that native code is outside the scope of the analysis. Based on the Nhandler extension [38] to Java PathFinder, JDart offers two strategies for dealing with native code.

  • Concrete Native. In this mode, JDart executes native code on concrete data values, and no symbolic execution of native parts is performed—only concrete values are passed to and from native calls, and symbolic values are not updated and cannot taint native return values. The return value is annotated with a new symbolic variable. As a consequence, the concrete side of an execution is faithful to the respective execution on a normal JVM. However, branches in the native code are not recorded in symbolic path conditions, which can lead to JDart not being able to explore branches after a native call as well. Another downside of this mode is that the implementation in Java PathFinder is relatively slow.

  • No Native. In this mode, JDart does not execute native code at all. Instead, it returns a default concrete value every time a native method is called and a return value is expected. The concrete value is annotated with the corresponding symbolic variable, using the method signature of the native method as the name of that variable. Concrete execution, in this case, is not faithful to the respective execution on a normal JVM as the introduced default values in most cases are not equal to the values that would be returned by the actual method invocations (and side effects are ignored as well). Recorded symbolic branches cannot be explored even if solutions are found by a constraint solver as there currently is no mechanism that allows feeding these values into the execution (instead of the default return values of native methods).

Since the ‘No Native’ mode is more performant and since currently there is no way of solving most of the recorded constraints in ‘Concrete Native’ mode (cf. results in Sect. 4), JDoop runs JDart in ‘No Native’ mode for native code. We use the ‘Concrete Native’ mode in our evaluation for analyzing the potential limiting impact of not executing native code faithfully and not being able to find and inject values that target branches in native code.

JDart produces the following outputs: a symbolic execution tree that contains all explored paths along with performance statistics, vectors of concrete input values that execute paths in the tree, and a suite of test cases (based on these vectors). A symbolic execution tree contains leaf nodes for all explored paths and additionally leaves for branches off of executed paths that could not be explored because the constraint solver was not able to produce adequate concrete values or because native code is not executed (in fully symbolic mode). For these leaves JDart does not generate input vectors or test cases.

2.2 Feedback-Directed Random Testing

A simple approach to automatic unit testing of object-oriented software is to completely randomly generate sequences of constructor+method invocations together with the respective concrete input values. However, this typically results in a large overhead since numerous sequences get generated with invalid prefixes that lead to violations of common implicit class or method requirements (e.g., passing null reference to a method that expects an allocated object). Moreover, such sequences cause trivial, uninteresting exceptions to be thrown early, thereby preventing deep exploration of the SUT state space. Hence, instead of generating unit tests blindly and in a completely random fashion, useful feedback can be gathered from previous test executions to direct the creation of new unit tests. In this way, unit tests that execute long sequences of method calls to completion (i.e., without exceptions being thrown) can be generated. This approach is known as feedback-directed random testing and is implemented in the Randoop automatic unit testing tool [29].

Randoop uses information from previous test executions to direct further unit test generation. The tool maintains two sets of constructor+method invocation sequences: those that do not violate a property (i.e., property-preserving) and those that do (i.e., property-violating). The property-violating set is initially empty, while the property-preserving set is initialized with an empty sequence. The default property that is maintained is unit test termination without any errors or exceptions being thrown. Randoop randomly selects a public method (or a constructor) and an existing sequence from the property-preserving set. It then appends the invocation of the selected constructor/method to the end of the sequence, and replaces primitive type arguments with concrete values that are randomly selected from a preset pool of values. Next, the newly generated sequences are compared against all previously generated sequences in the two sets. If it already exists, it is simply dropped and random selection is repeated. Otherwise, Randoop executes the new sequence and checks for property violations. If no properties are violated, the sequence is added to the property-preserving set and otherwise to the property-violating set. Randoop keeps on extending property-preserving sequences until it reaches a provided time limit.

3 Hybrid Approach

In this section, we describe our hybrid approach that integrates dynamic symbolic execution and feedback-directed random testing into an algorithm for automatic testing of object-oriented software. We implemented this algorithm as the JDoop tool that is freely available.Footnote 2 Figure 2 shows the flow of the algorithm, which is iterative and each iteration consists of several stages that we describe next.

3.1 Generation of Sequences

The first stage of every iteration of our algorithm is feedback-directed random testing using Randoop, which generates constructor+method sequences as described in Sect. 2.2. Randoop takes advantage of a pool of concrete primitive values to be used as constructor/method arguments when generating sequences. In the first iteration, we use the default pool with few values, which for the integer type are −1, 0, 1, 10, 100. Hence, an instance of a generated sequence for our running example from Fig. 1 is the one shown in the middle of the right half of the figure. Our algorithm grows the pool for subsequent iterations with concrete inputs generated by dynamic symbolic execution, which we describe later. The sequences generated in this stage serve two purposes. First, we employ them as standalone unit tests that exercise the SUT, which is their original intended purpose. Second, our hybrid algorithm also employs them as driver programs to be used in the subsequent dynamic symbolic execution stage.

Fig. 2.
figure 2

Iterative algorithm of JDoop for unit test generation. The algorithm integrates dynamic symbolic execution and feedback-directed random testing.

3.2 Selection and Transformation of Sequences

The previous stage typically generates far too many sequences to be successfully explored with a dynamic symbolic execution engine in a reasonable amount of time. For example, several thousands of valid sequences are often generated in just a few seconds. Hence, it is prudent to select a promising subset of the generated sequences to be transformed into inputs for the subsequent dynamic symbolic execution with JDart. The second stage implements the selection and transformation of constructor+method sequences.

Note that dynamic symbolic execution techniques have limitations, which is why we implemented the hybrid approach in the first place. In particular, they can typically treat symbolically only method arguments of primitive types. For example, if a sequence contains method calls with non-primitive types only, JDart will not be able to explore any additional paths. Hence, not every generated sequence is suitable for dynamic symbolic execution with JDart, and as the first step of this stage, we filter out all sequences with no arguments of a primitive type. Next, we have two strategies (i.e., heuristics) for selecting promising sequences. The first strategy randomly selects a subset of sequences. The second strategy prioritizes candidate sequences with more symbolic variables, which is based on the intuition that having more symbolic variables leads to more paths (and also branches and instructions) being covered. We compare the two strategies in our empirical evaluation. Once promising sequences are selected, they have to be appropriately transformed into driver programs for JDart.

Every candidate sequence is transformed for the final stage that performs dynamic symbolic execution. We achieve this by turning all constructor and method arguments of primitive types, which are supported by JDart, into symbolic input values. In our implementation, this is a simple source-to-source transformation. For instance, our example sequence results in the following driver program:

figure a

In the driver, the integer inputs to constructor and methods and are transformed into the arguments of the test method. The method is called from the main method that is added as an entry point for dynamic symbolic execution. Finally, JDart is instructed that the , s1, and s2 inputs to are treated symbolically.

3.3 Dynamic Symbolic Execution of Sequences

The last stage of every iteration is exploring the generated driver programs using dynamic symbolic execution as implemented in JDart. JDart explores paths through each driver program by solving path constraints over the specified symbolic inputs as described in Sect. 2.1. In the process, it generates additional unit tests, where each unit test corresponds to an explored path. The generated unit tests are added into the final set of unit tests. In addition to generating these unit tests, we also collect all the concrete input values that JDart generates in the process. We add these values back into the Randoop ’s concrete primitive value pool for the sequence generation stage of the next iteration. By doing this, we feed the information that the dynamic symbolic execution generates back into the feedback-directed random testing stage.

4 Empirical Evaluation

We aim to answer the following research questions using the results of our empirical evaluation.

  1. 1.

    Can JDoop cover paths that plain random test case generation does not, and how big is the positive impact of covering such paths? To answer this question, we compare the performance of Randoop (as our baseline) and JDoop, using code coverage as a metric for the quality of the generated test suites.

  2. 2.

    Can dynamic symbolic execution enable randomized test case generation to access regions of a SUT that remain untested otherwise, i.e., does the feedback loop from JDart to Randoop (see Fig. 2) have a measurable impact on achieved coverage? To answer this question, we run JDoop in multiple configurations with varying amounts of runtime attributed to Randoop and JDart, enabling a feedback loop in some configurations and preventing it in others.

  3. 3.

    What are the constituting factors impacting the effectiveness of JDoop in terms of the code coverage that can be achieved through automated generation of test suites? More specifically, can we confirm or refute the conjecture from related work [14] that robustness of the used dynamic symbolic execution engine is pivotal or do other factors exist that have an impact on the achievable coverage (e.g., selection of test cases for symbolic execution)? To answer this question, we analyze statistics produced by JDart and vary the strategy in JDoop for selecting method sequences for execution with JDart as discussed in Sect. 3 (either selecting sequences randomly or prioritizing those with many symbolic variables).

In the remainder of this section, we introduce the benchmarks we used in our evaluation, describe our experimental setup, and present and discuss the results of the evaluation.

Table 1. SF110 Benchmarks we use in the evaluation. Column #B is the number of branches, #I instructions, #M methods, and #C classes.

4.1 Benchmarks

We performed our empirical evaluation using the SF110 benchmark suite [37]. The suite consists of 110 Java projects that were randomly selected from the SourceForge repository of free software to reduce the threat to external validity (see Sect. 5). In our evaluation, we chose the largest subset of SF110 that both JDoop and Randoop can successfully execute on. Benchmarks that were excluded can be grouped into the following categories: unsuitable environment, inadequate or empty benchmarks, and deficiencies of testing tools. In the unsuitable environment category, benchmarks require privileged permissions in the operating system, a properly set configuration file, or a graphical subsystem to be available. There are several empty benchmarks, benchmarks that call the method that is not trapped by testing tools, and benchmarks that are otherwise inadequate because of conflicting dependencies with our testing infrastructure. Finally, for some benchmarks Randoop generates test cases that do not compile. All such problematic benchmarks were excluded from consideration, which left us with 41 benchmarks total, as listed in Table 1. For each benchmark we list the number of instructions, branches, methods, and classes, which demonstrates we use a wide range of SUTs in terms of their size and complexity.

4.2 Experimental Setup

We used two tools in our empirical evaluation: JDoop and Randoop (version 3.0.10). We explored several configurations of JDoop, where each configuration is determined by three parameters. The first parameter is the time limit for the first stage of every iteration, which is when Randoop runs (see Sect. 2.2); we vary this parameter as 1, 9, and 20 min. The second parameter is the time limit for the second and third stages combined, which is when JDart runs; we vary this parameter as 1, 9, and 40 min. The third parameter determines the strategy for selecting constructor+method call sequences as candidates for dynamic symbolic execution between: (1) random selection (denoted by R), and (2) prioritization based on the number of symbolic variables (denoted by P). Each configuration is code-named as JD-O-J-S, where O is the time limit for Randoop, J is the time limit for JDart, and S is the sequence selection strategy used. We explored the following six JDoop configurations: JD-1-9-P, JD-1-9-R, JD-9-1-P, JD-9-1-R, JD-20-40-P, and JD-20-40-R.

We carried out the evaluation in the Emulab testbed infrastructure [45]. We used 20 identical machines, each of which was equipped with two 2.4 GHz 64-bit 8-core processors, 64 GB of DDR4 RAM, and an SSD disk; the machines were running Ubuntu 16.04. We developed our testing infrastructure around the Apache Spark cluster computing framework. To facilitate reproducibility, each execution of a testing tool on a benchmark is performed in a pristine sandboxed virtualization environment. This is achieved via LXC containers running a reproducible build of Debian GNU/Linux code-named Stretch. We allocated 4 dedicated CPU cores and 8 GB of RAM to each container. Both Randoop and JDoop are multi-threaded, and hence they utilized the multiple available CPU cores. Our testing infrastructure is freely available for others to use and extend.Footnote 3

We allocate a one hour time limit per benchmark per testing tool/configuration for test case generation. Subsequent test case compilation and code coverage measurement phases are not counted toward the 1 h time limit. Given that both Randoop and JDoop employ randomized heuristics, we repeat each run 5 times to account for this variability—for each benchmark we compute an average and a standard deviation. In terms of code coverage metrics, we measured instruction and branch coverage at the Java bytecode level using JaCoCo  [20]. Furthermore, to get more insight into the performance of JDart, we collect statistics on the number of successful and failed runs, additional test cases it generates, symbolic variables in driver programs, times a constraint solver could not find valuation for a path condition, and JDart runs that explored one path versus multiple paths.

Table 2. Branch coverage (including standard deviations) averaged across 5 runs. The highest and lowest numbers per benchmark are given in bold and italic, respectively.

4.3 Evaluation of Test Coverage

Table 2 gives branch coverage results for each tool and configuration on all of the benchmarks. Most results are stable across multiple runs, meaning that the calculated standard deviations are very small. In particular, the standard deviations for Randoop on a vast majority of benchmarks are 0, even though we used a different random seed for every run. This suggests that Randoop reaches saturation and is unable to cover more branches. For the most part there are only small differences in the achieved coverage between different tools/configurations when looking at the total number of covered branches. However, JDoop (in one of its configurations) consistently achieves higher coverage than Randoop. Given that pure Randoop saturates, we can conclude that the improvements in coverage we observe with JDoop are due to leveraging dynamic symbolic execution. Among JDoop configurations, best-performing are the two 9-1 configurations where in an iteration Randoop runs for 9 min and JDart for 1 min; there are 6 such iterations in the 1 h time limit.

Fig. 3.
figure 3

Increases in branch coverage per benchmark by JDoop over baseline of Randoop (in % of coverage by baseline).

Figure 3 shows the increase in branch coverage per benchmark over pure Randoop that is achieved by some configuration of JDoop. The increase is measured as a percent increase in number of branches covered by JDoop over Randoop. Standard deviation is omitted in this graph as it was small in most instances (cf. Table 2). In two benchmarks JDoop performes slightly worse than pure Randoop. In roughly half of the benchmarks no change is observed—and in most cases with no variance. This suggests that these benchmarks are simply not amenable to increasing coverage by use of symbolic execution. In the remaining half of the benchmarks, branch coverage is increased. Increases range from 101.1% to 143.8% achieved coverage relative to the baseline of pure Randoop, with an average increase of 109.6% across this half of benchmarks.

Table 3. Statistics produced by JDart for single runs of all benchmarks in different configurations of JDoop. JDart uses Nhandler in the ‘No Native’ mode, except for one experiment that we performed in the ‘Concrete Native’ (CN) mode.

4.4 Profiling Dynamic Symbolic Execution

To analyze the potential impact of the robustness of dynamic symbolic execution on the validity of our results, we collected data from runs on all benchmarks for all configurations. We perform this analysis on data from single runs of JDoop as the other results show very little variation of results between different runs in most cases. Table 3 reports statistics on the JDart operation in different series of experiments. Data in the table is explained and discussed in the following paragraphs.

Modes of Operation. For all of the analyzed configurations of JDoop, JDart runs successfully in the vast majority of cases and produces significant numbers of test cases (up to 16, 588 in total for all benchmarks in one experiment). Most additional test cases are produced in the JD-1-9 configurations that enable the feedback loop between Randoop and JDart but grant the bulk of runtime to JDart. Across all configurations, random selection of method sequences for JDart leads to generating additional test cases for more benchmarks than prioritizing sequences with many symbolic variables. Prioritization, on the other hand, leads to more additional test cases in total.

Robustness and Scalability. Our data indicates that JDart is robust. Only a small number of runs fail (between 0.0% and 2.5%). Of these failures, only a tiny fraction is due to unhandled native code (less than 1%).Footnote 4 The vast majority of failed runs is caused by class-path issues in the benchmarks (more than 99%). There are only very few cases in which the constraint solver was not able to solve constraints of all paths in symbolic execution trees (between 0.0% and 1.8%).

Using Nhandler in the ‘Concrete Native’ mode leads to native calls being executed faithfully and to longer recorded path conditions, as discussed in Sect. 2. This yields constraints that are marked as not solvable (‘don’t know’ or D/K for short) in 93.6% of all discovered paths in symbolic execution trees. This indicates the likelihood of JDart not being able to explore most of the paths that could be explored with proper symbolic treatment of native methods. Table 4 reports the number of occurrences for all encountered native methods in one run of JDoop. As can be seen from the data, the charAt method of the String class offers by far the greatest potential for improving on the number of explored paths. Note, however, that numbers in the table do not necessarily translate into the same number of additional paths as occurrences are counted along paths in trees and the same method call may appear on multiple paths.

Amenable Test Cases. The number of symbolic variables per test case behaves as expected: it increases when using prioritization of sequences with many variables. Prioritization, however, comes at a cost since there tends to be more runs of JDart in configurations that do not use prioritization. For all benchmarks, a high number of runs yields only one path and hence no additional test cases. A considerable number of these runs may be attributed to using Nhandler in the ‘No Native’ mode, thereby hiding branches by not executing native code. On the other hand, even in the experiment in which Nhandler was used in the ‘Concrete Native’ mode, two thirds of all runs explored only a single path. This indicates that many method sequences that were selected for JDart simply do not branch on symbolic variables.

Table 4. Symbolic Variables introduced by Nhandler in the ‘Concrete Native’ mode in a single run of JD-9-1.

4.5 Discussion

The obtained results allow us to provide answers to our research questions.

Question 1: Covering More Paths. JDoop consistently outperforms Randoop on roughly 50% of the benchmarks (see Table 2 and Fig. 3). Measured in absolute number of branches, the margins are relatively slim in many cases. There are, however, cases in which the achieved branch coverage is increased by 28%—resulting in an increase in code coverage by 5.4% points (26_jipa). On about 50% of the benchmarks no variation can be seen in coverage between both approaches. Together with the little variance that is observed between different runs this indicates that Randoop in many cases reaches a state where achievable coverage is (nearly) saturated. It makes sense that in such a scenario JDoop does not add many percentage points in code coverage. It merely adds coverage through those hard-to-hit corner cases.

Question 2: Reachable Regions. Our results indicate that the feedback loop has a positive impact. The JD-9-1 configurations perform better than other configurations in most cases. Regarding the time distribution between Randoop and JDart the picture is not as clear. There is a lot more variation in the margins of coverage increase (or decrease sometimes) for the configuration that grants most of the time to JDart. In one particularly amenable case this results in coverage increase by 43% (from 13.7% to 19.7% for 49_diebierse).

Question 3: Robustness of Symbolic Execution. Here, we have to refute the conjecture that was made in related work [14], namely that a robust dynamic symbolic execution engine can reap big increases in code coverage—or at least curb expectations about achievable coverage increases. Our experiments showed that JDart handles most benchmarks without many problems. Proper analysis of native code (especially for String methods) certainly has the potential to improve code coverage further, but the consistently high number of symbolic analyses that result in a single path (even in the control experiment) points to another important factor that contributes to small margins: the generated test cases simply do not allow exploring many new branches in most cases.

The experiments even indicate that it does not pay off to prioritize method sequences with many variables for JDart. Prioritization adds cost twice: once for analyzing test cases and then for exploring with many variables. Taking into account the observation from the first answer, that Randoop (almost) achieves saturation of coverage in one hour, this again indicates that in JDoop corner cases are discovered by JDart. Covering more search space beats investigating the few locations more intensively in such a scenario.

Remark on Achievable Coverage. Our observations correlate well with the observations made in [12], where the results of a static analysis of the SF110 benchmark suite are reported. The analysis revealed that only 6.6% of methods in the benchmark suite have path constraints that are exclusively composed of primitive type elements. On the other hand, the study identified objects in path constraints, calls to external libraries or native code, and exception-dependent paths as challenges to symbolic execution. The authors report that one third of methods have paths that deal with exceptions.

The low coverage (in absolute numbers) and low variance across all benchmarks for Randoop and JDoop in our experiments suggests that many branches simply cannot be covered by test cases that only rely on calling methods of objects from a project under test. Many branches rely on return values of calls to external libraries or the occurence of exceptions, which are not triggered in a simple testing environment. Since there is no simple or automated approach for determining the achievable coverage for a benchmark, we sampled a few individual benchmarks and indeed quickly found cases where catch-blocks in the code contained comments to the effect that the block is unreachable.

Taking into account the results from [12] and our findings, we conjecture that the branch coverage that is achieved by JDoop is close to the coverage that can be achieved without making the environment of a tested project symbolic.

5 Threats to Validity

Threats to External Validity. While the main purpose of the SF110 corpus of benchmarks is to reduce the threat to external validity since they were chosen randomly, we cannot be absolutely sure that the benchmarks we used are representative of Java programs. In addition, we excluded a number of problematic benchmarks from our evaluation (see Sect. 4.1). Hence, our results might not generalize to all programs. In JDoop we integrated Randoop and JDart, and we used Randoop as the baseline in our evaluation. We attempted to include another contemporary state-of-the-art Java testing tool into the comparison, and EvoSuite was an obvious choice to try. However, to the best of our ability we did not manage to get it to work with JaCoCo (the tool we use for measuring code coverage) on our benchmark suite despite exchanging numerous emails with the EvoSuite authors. This is a well-known problem caused by the online bytecode modifications that EvoSuite often performs.Footnote 5 While others successfully combined EvoSuite and JaCoCo in the past, that was accomplished only on very simple programs; in addition, others also reported differences in coverage results between EvoSuite’s internal measurements and JaCoCo.Footnote 6\({^,}\)Footnote 7 Hence, we could not perform a direct comparison and our results might not generalize to other tools. However, earlier work on EvoSuite reports similar results to ours with respect to using dynamic symbolic execution in combination with random testing [14]. Finally, note that we do not include the environment and dependencies of benchmarks into unit test generation, which might lead to sub-optimal coverage.

Threats to Internal Validity. In our evaluation, we experimented with 3 different time allocations for Randoop and JDart that we identified as representative. While our results show no major differences between these different time allocations, we did not fully explore this space and there might be a ratio that would lead to a different outcome. JDart currently cannot symbolically explore native calls, which might lead to not being able to cover program paths (and hence also branches and instructions) that depend on such calls. Our evaluation shows that this indeed happens and that native implementations of methods of the String class in Java are the main culprit, but it does not allow us to provide an estimate of the impact on the achieved code coverage. Finally, while we extensively tested JDoop to make sure it is reliable and performed sanity checks of our results, there is a chance for a bug to have crept in that would influence our results.

Threats to Construct Validity. Here, the main threat is the metrics we used to assess the quality of the generated test suites, and in particular branch coverage in the presence of dead code [3, 27]. This threat is reduced by previous work showing that branch coverage performs well as a criterion for comparing test suites [16].

6 Related Work

Symbolic Execution. Dynamic symbolic execution [17, 36] is a well-known technique implemented by many automatic testing tools (e.g., [6, 18, 35, 43]). For example, SAGE [18] is a white-box fuzzer based on dynamic symbolic execution. It has been routinely applied to large software systems, such as media players and image processors, where it has been successful in finding security bugs. Khurshid et al. [25] extend symbolic execution to support dynamically allocated structures, preconditions, and concurrency.

Several symbolic execution tools specifically target Java bytecode programs. A number of them implement dynamic symbolic execution via Java bytecode instrumentation. JCute  [35], the first dynamic symbolic execution engine for Java, uses Soot [39] for instrumentation and lp_solve for constraint solving. CATG  [41] uses ASM [2] for instrumentation and CVC4 [10] for constraint solving. Another dynamic symbolic execution engine, LCT  [24], supports distributed exploration; it uses Boolector and Yices for solving, but it does not have support for float and double primitive types. A drawback of instrumentation-based tools is that instrumentation at the time of class loading is confined to the SUT. For example, LCT does not by default instrument the standard Java libraries thus limiting symbolic execution only to the SUT classes. Hence, the instrumentation-based tools discussed above provide the possibility of using symbolic models for non-instrumented classes or using pre-instrumented core Java classes.

Several dynamic symbolic execution tools for Java are not based on instrumentation. For example, the dynamic symbolic white-box fuzzer jFuzz  [21] is based on Java PathFinder (as is JDart) and can thus explore core Java classes without any prerequisites. Symbolic PathFinder (SPF) [32] is a Java PathFinder extension similar to JDart. In fact, JDart reuses some of the core components of an older version of SPF, notably the solver interface and its implementations. While at its core SPF implements symbolic execution, it can also switch to concrete values in the spirit of dynamic symbolic execution [30]. That enables it to deal with limitations of constraint solvers (e.g., non-linear constraints).

Hybrid Approaches. There are several approaches similar to ours that combine fuzzing or a similar testing technique with dynamic symbolic execution. Garg et al. [15] propose a combination of feedback-directed random testing and dynamic symbolic execution for C and C++ programs. However, they are addressing challenges of a different target language and on a much smaller collection of benchmarks that they simplified before evaluation. The Driller tool [40] interleaves fuzzing and dynamic symbolic execution for bug finding in program binaries, and it targets single-file binaries in search of security bugs. Galeotti et al. [14] apply dynamic symbolic execution in the EvoSuite tool to explore test cases generated with a genetic algorithm. Even though their evaluation is carried out in a different way than the one presented in this paper, the general conclusion is the same in spirit—dynamic symbolic execution does not provide a lot of additional coverage on real-world object-oriented Java software on top of a random-based test case generation technique. MACE [7] combines automata learning with dynamic symbolic execution to find security vulnerabilities in protocol implementations.

There are other automated hybrid software testing tools that do not strictly combine with symbolic execution (e.g., OCAT [22], Agitator [5], Evacon [19], Seeker [42], DSD-Crasher [8]). Because these tools either focus on a single method at a time or just form random method call sequences, they often fail to drive program execution to hard-to-reach sites in a SUT, which can result in suboptimal code coverage.

Random Testing. Randoop [29] is a feedback-directed random testing algorithm that forms random test cases that are sequences of method calls, while ensuring basic properties such as reflexivity, symmetry, and transitivity. Search-based software testing [28] approaches and tools are gaining traction, which is reflected in four annual search-based software testing tool competitions in recent years [33]. A prominent search-based tool is EvoSuite [13], which combines a genetic algorithm and dynamic symbolic execution. T3 [31] is a tool that generates randomized constructor and method call sequences based on an optimization function. JTExpert [34] keeps track of methods that can change the underlying object and constructs method sequences that are likely to get the object into a desired state. All the search-based testing tools are geared toward testing at the class level, while JDoop performs testing at the application/library level.

Benchmarking Infrastructures. In computer science, any extensive empirical evaluation, software competition, or reproducible research requires a significant software+hardware infrastructure. The Software Verification Competition’s BenchExec [4] is a software infrastructure for evaluating verification tools on programs containing properties to verify. It comes with an interface for verification tools to follow, which did not fit our needs: our coverage measurement outcomes cannot be judged in terms of program correctness. The Search-based Software Testing Competition [33] community created an infrastructure for the competition as well. However, just like tools that participate in the competition, their infrastructure is geared toward running a testing tool on just one class at a time. Emulab [45] and Apt [1] are testbeds that provide researchers with an accessible hardware and software infrastructure. They allow for repeatable and reproducible research, especially in the domain of computer systems, by providing an environment to specify the hardware to be used, on top of which users can install and configure a variety of systems.

7 Conclusions

We introduced a hybrid automatic testing approach for object-oriented software, described its implementation JDoop, and performed an extensive empirical exploration of this space. Our approach is an integration of feedback-directed random testing (Randoop) and dynamic symbolic execution (JDart), where random testing performs global exploration, while dynamic symbolic execution performs local exploration (around interesting global states) of a SUT. It is an iterative algorithm where these two exploration techniques are interleaved in multiple iterations. Our evaluation on real-world object-oriented software shows that dynamic symbolic execution provides consistent improvements in terms of code coverage on top of our baseline (pure feedback-directed random testing) on those examples that are amenable to this method of testing.