Keywords

1 Introduction

In search-based software engineering and particularly search-based testing, popular heuristics (e.g.,[17]) with best-practice configurations in terms of operators and parameters (e.g.,[7]) are often used. As this out-of-the-box usage typically leads to suboptimal results, costly trial-and-error experiments are performed to find a suitable configuration for a given problem, which leads to better results [4]. To obtain better results while avoiding trial-and-error experiments, fitness landscape analysis can be used [16, 23]. The goal is to analytically understand the search problem, determine difficulties of the problem, and identify suitable configurations of heuristics that can cope with these difficulties (cf.  [16, 19]).

In this paper, we investigate the search problem of test suite generation for mobile applications (apps). We rely on Sapienz that uses a default NSGA-II to generate test suite for apps [17]. NSGA-II has been selected as it “is a widely-used multiobjective evolutionary search algorithm, popular in SBSE research” [17, p. 97], but without adapting it to the specific problem (instance). Thus, our goal is to analyze the fitness landscape of Sapienz and use the insights for adapting the heuristic of Sapienz. This should eventually yield better test results.

Our analysis focuses on the global topology of the landscape, especially how solutions (test suites) are spread in the search space and evolve over time. Thus, we are interested in the genotypic diversity of solutions, which is considered important for evolutionary search [30]. According to our analysis, Sapienz lacks diversity of solutions so that we extend it to Sapienz \(^{div}\) that integrates four diversity promoting mechanisms. Therefore, our contributions are the descriptive study analyzing the fitness landscape of Sapienz (Sect. 3), Sapienz \(^{div}\) (Sect. 4), and the empirical study with 76 apps evaluating Sapienz \(^{div}\) (Sect. 5).

2 Background: Sapienz and Fitness Landscape Analysis

Sapienz is a multi-objective search-based testing approach [17]. Using NSGA-II, it automatically generates test suites for end-to-end testing of Android apps. A test suite t consists of m test cases \(\left\langle s_1, s_2,...,s_m \right\rangle \), each of which is a sequence of up to n GUI-level events \(\left\langle e_1, e_2,...,e_n \right\rangle \) that exercise the app under test. The generation is guided by three objectives: (i) maximize fault revelation, (ii) maximize coverage, and (iii) minimize test sequence length. Having no oracle, Sapienz considers a crash of the app caused by a test as a fault. Coverage is measured at the code (statement coverage) or activity level (skin coverage). Given these objectives, the fitness function is the triple of the number of crashes found, coverage, and sequence length. To evaluate the fitness of a test suite, Sapienz executes the suite on the app under test deployed on an Android device or emulator.

A fitness landscape analysis can be used to better understand a search problem [16]. A fitness landscape is defined by three elements (cf.  [28]): (1) A search space as a set X of potential solutions. (2) A fitness function \(f_k : X \rightarrow \mathrm{I\!R}\) for each of the k objectives. (3) A neighborhood relation \(N : X \rightarrow 2^{X}\) that associates neighbor solutions to each solution (e.g., using basic operators, or distances of solutions). Based on these three elements, various metrics have been proposed to analyze the landscape [16, 23]. They characterize the landscape, for instance, in terms of the global topology (i.e., how solutions and the fitness are distributed), local structure (i.e., ruggedness and smoothness), and evolvability (i.e., the ability to produce fitter solutions). The goal of analyzing the landscape is to determine difficulties of a search problem and identify suitable configurations of search algorithms that can cope with these difficulties (cf.  [16, 19]).

3 Fitness Landscape Analysis of Sapienz

3.1 Fitness Landscape of Sapienz

At first, we define the three elements of a fitness landscape (cf.  Sect. 2) for Sapienz: (1) The search space is given by all possible test suites t according to the representation of test suites in Sect. 2. (2) The fitness function is given by the triple of the number of crashes found, coverage, and test sequence length (cf.  Sect. 2). (3) As the neighborhood relation we define a genotypic distance metric for two test suites (see Algorithm 1). The distance of two test suites \(t_1\) and \(t_2\) is the sum of the distances between their ordered test sequences, which is obtained by comparing all sequences \(s^{t1}_i\) of \(t_1\) and \(s^{t2}_i\) \(t_2\) by index i (lines 2–4). The distance of two sequences is the difference of their lengths (line 5) increased by 1 for each different event at index j (lines 6–9). Thus, the distance is based on the differences of ordered events between the ordered sequences of two test suites.

figure a

This metric is motivated by the basic mutation operator of Sapienz shuffling the order of test sequences within a suite, and the order of events within a sequence. It is common that the neighborhood relation is based on operators that make small changes to solutions [19].

3.2 Experimental Setup

To analyze the fitness landscape of Sapienz, we extended Sapienz with metrics that characterize the landscape. We then executed Sapienz on five apps, repeat each execution five times, and report mean values of the metrics for each app.Footnote 1

The five apps we selected for the descriptive study are part of the 68 F-Droid benchmark apps [6] used to evaluate Sapienz  [17]. We selected aarddict, MunchLife, and passwordmanager since Sapienz did not find any fault for these apps, and hotdeath and k9mailFootnote 2, for which Sapienz did find faults [17]. Thus, we consider apps for which Sapienz did and did not reveal crashes to obtain potentially different landscape characteristics that may present difficulties to Sapienz.

We configured Sapienz as in [17]. The crossover and mutation rates are set to 0.7 and 0.3 respectively. The population and offspring size is 50. An individual (test suite) contains 5 test sequences, each constrained to 20–500 events. Instead of 100 generations [17], we observed in initial experiments that the search stagnates earlier so that we set the number of generation to 40 (stopping criterion).

3.3 Results

The results of our study provide an analysis of the fitness landscape of Sapienz with respect to the global topology, particularly the diversity of solutions, how the solutions are spread in the search space, and evolve over time. According to Smith et al.  [27, p. 31], “No single measure or description can possibly characterize any high-dimensional heterogeneous search space”. Thus, we selected \(11\) metrics from literature and implemented them in Sapienz, which characterize (1) the Pareto-optimal solutions, (2) the population, and (3) the connectedness of Pareto-optimal solutions, all with a focus on diversity. These metrics are computed after every generation so that we can analyze their development over time. In the following, we discuss these \(11\) metrics and the results of the fitness landscape analysis. The results are shown in Fig. 1 where the metrics (y-axis) are plotted over the 40 generations of the search (x-axis) for each of the five apps.

Fig. 1.
figure 1

Fitness landscape analysis results for Sapienz.

(1) Metrics for Pareto-Optimal Solutions 

\(\bullet \) Proportion of Pareto-optimal solutions (ppos). For a population P, ppos is the number of Pareto-optimal solutions \(P_{opt}\) divided by the population size: \(ppos(P) = \frac{|P_{opt}|}{|P|}\). A high and especially strongly increasing ppos may indicate that the search based on Pareto dominance stagnates due to missing selection pressure [24]. A moderately increasing ppos may indicate a successful search.

For Sapienz and all apps (see Fig. 1(a)), ppos slightly fluctuates since a new solution can potentially dominate multiple previously non-dominated solutions. At the beginning of the search, ppos is low (0.0–0.1), shows no improvement in the first 15–20 generations, and then increases for all apps except of passwordmanager. Thus, the search seems to progress while the enormously increasing ppos for MunchLife and hotdeath might indicate a stagnation of the search.

\(\bullet \) Hypervolume (hv). To further investigate the search progress, we compute the hv after each generation. The hv is the volume in the objective space covered by the Pareto-optimal solutions [10, 31]. Thus, an increasing hv indicates that the search is able to find improved solutions, otherwise the hv and search stagnate.

Based on the objectives of Sapienz (max. crashes, max. coverage, and min. sequence length), we choose the nadir point (0 crashes, 0 coverage, and sequence length of 500) as the reference point for the hv. In Fig. 1(b), the evolution of the hv over time rather than the absolute numbers are relevant to analyze the search progress of Sapienz. While the hv increases during the first 25 generations, it stagnates afterwards for all apps; for k9mail already after 5 generations. For aarddict, MunchLife, and hotdeath the hv stagnates after the ppos drastically increases (cf.  Fig. 1(a)), further indicating a stagnation of the search.

(2) Population-Based Metrics 

\(\bullet \) Population diameter (diam). The diam metrics measure the spread of all population members in the search space using a distance metric for individuals, in our case Algorithm 1. The maximum diam computes the largest distance between any two individuals of the population P: \({\textit{maxdiam}} (P) = \max _{x_i, x_j \in P} dist(x_i, x_j)\) [5, 20], showing the absolute spread of P. To respect outliers, we can compute the average diam as the average of all pairwise distances between all individuals [5]:

$$\begin{aligned} {\textit{avgdiam}} (P) = \frac{\sum _{i=0}^{|P|}\sum _{j=0,j \ne i}^{|P|}{dist(x_i, x_j)}}{ |P|(|P|-1)} \end{aligned}$$
(1)

Additionally, we compute the minimum diameter to see how close individuals are in the search space, or even identical: \({\textit{mindiam}} (P) = \min _{x_i, x_j \in P} dist(x_i, x_j)\).

Concerning each plot for Sapienz and all apps (see Fig. 1(c)), the upper, middle, and lower curve are respectively maxdiam, avgdiam, and mindiam. For each curve, we see a clear trend that the metrics decrease over time, which is typical for genetic algorithms due to the crossover. However, the metrics drastically decrease for Sapienz in the first 25 generations. The avgdiam decreases from \(1500\) to eventually \(200\) for each app. The maxdiam decreases similarly but stays higher for hotdeath and k9mail than for the other apps. The development of the avgdiam and maxdiam indicates that all individuals are continuously getting closer to each other in the search space, thus becoming more similar. The population even contains identical solutions as indicated by mindiam reaching 0.

\(\bullet \) Relative population diameter (reldiam). Bachelet [5] further proposes the relative population diameter, which is the avgdiam in proportion to the largest possible distance d: \({\textit{reldiam}} (P) = \frac{avgdiam(P)}{d}\). This metric is indicative of the concentration of the population in the search space. A small reldiam indicates that the population members are grouped together in a region of the space [5].

For Sapienz, the largest possible distance d between two test suites is 2500, in which case they differ in all events (up to 500 for a test sequence) for all of their five individual test sequences. For \(d = 2500\) and all apps (cf.  Fig. 1(d)), reldiam starts at a high level of around 0.9 indicating that the solutions are spread in the search space. Then, it decreases in the first 25 generations to around 0.4 (aarddict, MunchLife, and passwordmanager), and below 0.3 (hotdeath and k9mail) indicating a grouping of the solutions in one or more regions of the search space.

(3) Metrics Based on the Connectedness of Pareto-Optimal Solutions 

The following metrics analyze the connectedness and thus, clusters of Pareto-optimal solutions in the search space [9, 22]. For this purpose, we consider a graph in which Pareto-optimal solutions are vertices. The edges connecting the vertices are labeled with weights \(\delta \), which are the number of moves a neighborhood operator has to make to reach one vertice from another [22]. This results in a graph of fully connected Pareto-optimal solutions. Introducing a limit k on \(\delta \) and removing the edges whose weights \(\delta \) are larger than k leads to varying sizes of connected components (clusters) in the graph. This graph can be analyzed by metrics to characterize the Pareto-optimal solutions in the search space [12, 22].

In our case, the weights \(\delta \) are determined by the distance metric for test suites based on the mutation operator of Sapienz (cf.  Algorithm 1). We determined k experimentally to be 300 investigating values of 400, 300, 200, and 100. While a high value results in a single cluster of Pareto-optimal solutions, a low value results in a high number of singletons (i.e., clusters with one solution). Thus, two test suites (vertices) are connected (neighbors) in the graph if they differ in less than 300 events across their test sequences as computed by Algorithm 1.

\(\bullet \) Proportion of Pareto-optimal solutions in clusters (pconnec). This metric divides the number of vertices (Pareto-optimal solutions) that are members of clusters (excl. singletons) by the total number of vertices in the graph [22]. A high pconnec indicates a grouping of the Pareto-optimal solutions in the search space.

As shown in Fig. 1(e), pconnec is relatively low during the first generations before it increases for all apps. For MunchLife, passwordmanager, and hotdeath, pconnec reaches 1 meaning that all Pareto-optimal solutions are in clusters, while it converges around 0.7 and 0.8 for aarddict and k9mail respectively. This indicates that the Pareto-optimal solutions are grouped in the search space.

\(\bullet \) Number of clusters (nconnec). We further analyze in how many areas of the search space (clusters) the Pareto-optimal solutions are grouped. Thus, nconnec counts the number of clusters in the graph [12, 22]. A high (low) nconnec indicates that the Pareto-optimal solutions are spread in many (few) areas of the search space.

Figure 1(f) plots nconnec for Sapienz and all apps. The y-axis of each plot denoting nconnec ranges from 0 to 6. Initially, the Pareto-optimal solutions are distributed in 2–4 clusters, then grouped in 1 cluster. An exception is k9mail for which there always exists more than 3 clusters. Except for k9mail, this indicates that the Pareto-optimal solutions are grouped in one area of the search space.

\(\bullet \) Minimum distance k for a connected graph (kconnec). This metric identifies k so that all Pareto-optimal solutions are members of one cluster [12, 22]. Thus, kconnec quantifies the spread of all Pareto-optimal solutions in the search space.

For Sapienz, Fig. 1(g) plots kconnec (ranging from 0 to 1400) over the generations. Similarly to the diam metrics (cf.  Fig. 1(c)), kconnec decreases, moderately for hotdeath  (from initially \(\approx \)700 to \(\approx \)600) and k9mail (\(\approx \)1000 \(\rightarrow \) \(\approx \)800), and drastically for passwordmanager (\(\approx \)1200 \(\rightarrow \) \(\approx \)200), MunchLife (\(\approx \)1000 \(\rightarrow \) \(\approx \)200), and aarddict (\(\approx \)600 \(\rightarrow \) \(\approx \)100). This indicates that all Pareto-optimal solutions are getting closer in the search space as the spread of the cluster is decreasing.

\(\bullet \) Number of Pareto-optimal solutions in the largest cluster (lconnec). It determines the size of the largest cluster by the number of members [12], showing how many Pareto-optimal solutions are in the most dense area of the search space.

Figure 1(h) plots lconnec (ranging from 0 to 50 given the population size of 50) over the generations. lconnec increases after 15–30 generations to 20 (aarddict and hotdeath) or even 50 (MunchLife) solutions. This indicates that the largest cluster is indeed large so that many Pareto-optimal solutions are grouped in one area of the search space. In contrast, lconnec stays always below 10 indicating smaller largest clusters for passwordmanager and k9mail than for the other apps.

\(\bullet \) Proportion of hypervolume covered by the largest cluster (hvconnec). Besides lconnec, we compute the relative size of the largest cluster in terms of hypervolume (hv). Thus, hvconnec is the proportion of the overall hv covered by the Pareto-optimal solutions in the largest cluster. It quantifies how this cluster in the search space dominates in the objective space and contributes to the hv.

For Sapienz (cf.  Fig. 1(i)), hvconnec varies a lot during the first 10 generations, then stabilizes at a high level for all apps. For aarddict, MunchLife, and passwordmanager, the largest clusters covers \(100\%\) of the hv since there is only 1 cluster left (cf.  nconnec in Fig. 1(f)). For hotdeath, hvconnec is close to \(70\%\) indicating that there is 1 other cluster covering \(30\%\) of the hv (cf.  nconnec). For k9mail, hvconnec is around \(90\%\) indicating that the other 2–3 clusters (cf.  nconnec) cover only \(10\%\) of the hv. This indicates that the largest cluster covers the largest proportion of the hv, and thus contributes most to the Pareto front.

3.4 Discussion

The results characterizing the fitness landscape of Sapienz reveal insights about how Sapienz manages the search problem of generating test suites for apps.

Firstly, the development of the proportion of Pareto-optimal solutions (cf. Fig. 1(a)) and hypervolume (cf.  Fig. 1(b)) indicates a stagnation of the search after 25 generations. The drastically increasing proportion of Pareto-optimal solutions in some cases may indicate a problem of dominance resistance, i.e., the search cannot produce new solutions that dominate the current, poorly performing but locally non-dominated solutions [24]. In other cases, the proportion remains low, i.e., the search cannot find many non-dominated solutions.

Secondly, the development of the population diameters (cf.  Fig. 1(c)) indicate a decreasing diversity of all solutions during the search. The development of the relative population diameter (cf.  Fig. 1(d)) witnesses this observation and indicates that the population members are concentrated in the search space [5]. The minimum diameter (cf.  Fig. 1(c)) even indicates that the population contains duplicates of solutions, which reduces the genetic variation in the population.

Thirdly, the development of the proportion of Pareto-optimal solutions in clusters (cf.  Fig. 1(e)) indicates a grouping of these solutions in the search space, mostly in one cluster (cf.  Fig. 1(f)). Another indicator for the decreasing diversity of the Pareto-optimal solutions is the decreasing minimum distance k required to form one cluster of all these solutions (cf.  Fig. 1(g)). Additionally, the largest cluster is often indeed large in terms of number of Pareto-optimal solutions (cf.  Fig. 1(h)), and hypervolume covered by these solutions (cf.  Fig. 1(i)). Even if there exist multiple clusters of Pareto-optimal solutions, the largest cluster still contributes most to the overall hypervolume and thus, to the Pareto front.

In summary, the fitness landscape analysis of Sapienz indicates a stagnation of the search while the diversity of all solutions decreases in the search space.

4 Sapienz \(^{div}\)

Given the fitness landscape analysis results, Sapienz suffers from a decreasing diversity of solutions in the search space over time. It is known that the performance of genetic algorithms is influenced by diversity [21, 30]. A low diversity may lead the search to a local optimum that cannot be escaped easily [30]. Thus, diversity is important to address dominance resistance so that the search can produce new solutions that dominate poorly performing, locally non-dominated solutions [24]. Moreover, Shir et al.  [26, p. 95] report that promoting diversity in the search space does not hamper “the convergence to a precise and diverse Pareto front approximation in the objective space of the original algorithm”.

Therefore, we extended Sapienz to Sapienz \(^{div}\) by integrating mechanisms into the search algorithm that promote the diversity of the population in the search space.Footnote 3 We developed four mechanisms that extend the Sapienz algorithm at different steps: at the initialization, before and after the variation, and at the selection. Algorithm 2 shows the extended search algorithm of Sapienz \(^{div}\) and highlights the novel mechanisms in blue. We now discuss these mechanisms.

Diverse initial population. As the initial population may effect the results of the search [13], we assume that a diverse initial population could be a better start for the exploration. Thus, we extend the generation of the initial population \(P_{init}\) to promote diversity. Instead of generating \(|P| = size_{pop}\) solutions, we generate \(size_{init}\) solutions where \(size_{init}>size_{pop}\) (line 7 in Algorithm 2). Then, we select those \(size_{pop}\) solutions from \(P_{init}\) that are most distant from each other using Algorithm 1, to form the first population P (line 8).

figure b

Adaptive diversity control. This mechanism dynamically controls the diversity if the population members are becoming too close in the search space relative to the initial population. It further makes the algorithm adaptive as it uses feedback of the search to adapt the search (cf.  [30]).

To quantify the diversity \(div_{pop}\) of population P, we use the average population diameter (avgdiam) defined in Eq. 1. At the beginning of each generation, \(div_{pop}\) is calculated (line 13) and compared to the diversity of the initial population \(div_{init}\) (line 14) calculated once in line 10. The comparison checks whether \(div_{pop}\) has decreased to less than \(div_{limit} \times div_{init}\). For example, the condition is satisfied for the given threshold \(div_{limit}=0.4\) if \(div_{pop}\) has decreased to less than \(40\%\) of \(div_{init}\).

In this case, the offspring Q is obtained by generating new solutions using the original Sapienz method to initialize a population (line 15). The next population is formed by selecting the |P| most distant individuals from the current population P and offspring Q (line 17). In the other case, the variation operators (crossover and mutation) of Sapienz are applied to obtain the offspring (line 19) followed by the selection. Thus, this mechanism promotes diversity by inserting new individuals to the population, having an effect of restarting the search.

Duplicate elimination. The fitness landscape analysis found duplicated test suites in the population. Eliminating duplicates is one technique to maintain diversity and improve search performance [25, 30]. Thus, we remove duplicates after reproduction and before selection in the current population and offspring (line 21). Duplicated test suites are identified by a distance of 0 computed by Algorithm 1.

Hybrid selection. To promote diversity in the search space, the selection is extended by dividing it in two parts: (1) The non-dominated sorting of NSGA-II is performed as in Sapienz (lines 22–29 in Algorithm 2) to obtain the solutions \(P'\) sorted by domination rank and crowding distance. (2) From \(P'\), the best \((size_{pop} - n_{div})\) solutions form the next population P where \(size_{pop}\) is the size of P and \(n_{div}\) the configurable number of diverse solutions to be included in P (line 30). These \(n_{div}\) diverse solutions \(P_{div}\) are selected as the most distant solutions from the current population and offspring PQ (line 31) using the distance metric of Algorithm 1. Finally, \(P_{div}\) is added to the next population P (line 32).

While the NSGA-II sorting considers the diversity of solutions in the objective space (crowding distance), the selection of Sapienz \(^{div}\) also considers the diversity of solutions in the search space, which makes the selection hybrid.

5 Evaluation

We evaluate Sapienz \(^{div}\) in a head-to-head comparison with Sapienz to investigate the benefits of the diversity-promoting mechanisms. Our evaluation targets five research questions (RQ) with two empirical studies similarly to [17]:

  • RQ1. How does the coverage achieved by Sapienz \(^{div}\) compare to Sapienz?

  • RQ2. How do the faults found by Sapienz \(^{div}\) compare to Sapienz?

  • RQ3. How does Sapienz \(^{div}\) compare to Sapienz concerning the length of their fault-revealing test sequences?

  • RQ4. How does the runtime overhead of Sapienz \(^{div}\) compare to Sapienz?

  • RQ5. How does the performance of Sapienz \(^{div}\) compare to the performance of Sapienz with inferential statistical testing?

5.1 Experimental Setup

We conduct two empirical studies, Study 1 to answer RQ1–4, and Study 2 to answer RQ5. The execution of both studies was distributed on eight serversFootnote 4 while each server runs one approach to test one app at a time using 10 Android emulators (Android KitKat version, API 19). We configured Sapienz and Sapienz \(^{div}\) as in the experiment for the fitness landscape analysis (cf.  Sect. 3.2) and in [17]. The only difference is that we test each app for 10 generations in contrast to Mao et al.  [17] who test each app for one hour, since we were not in full control of the servers running in the cloud. However, we still report the execution times of both approaches (RQ4). Moreover, we configured the novel parameters of Sapienz \(^{div}\) as follows: \(size_{init}=100\), \(div_{limit}=0.5\), and \(n_{div}=15\). For Study 1 we perform one run to test each app over 10 generations by each approach. For Study 2 we perform 20 repetitions of such runs for each app and approach.

5.2 Results

Study 1.  In this study we use 66 of the 68 F-Droid benchmark appsFootnote 5 provided by Choudhary et al.  [6] and used to evaluate Sapienz  [17]. The results on each app are shown in Table 1 where S refers to Sapienz, Sd to Sapienz \(^{div}\), Coverage to the final statement coverage achieved, #Crashes to the number of revealed unique crashes, Length to the average length of the minimal fault-revealing test sequences (or ‘–’ if no fault has been found), and Time (min) to the execution time in minutes of each approach to test the app over 10 generations.

RQ1. Sapienz achieves a higher final coverage for 15 apps, Sapienz \(^{div}\) for 24 apps, and both achieve the same coverage for 27 apps. Figure 2 shows that a similar coverage is achieved by both approaches on the 66 apps, in average 45.05 by Sapienz and 45.67 by Sapienz \(^{div}\), providing initial evidence that Sapienz \(^{div}\) and Sapienz perform similarly with respect to coverage.

RQ2. To report about the found faults, we count the total crashes, out of which we also identify the unique crashes (i.e., their stack traces are different from the traces of the other crashes of the app). Moreover, we exclude faults caused by the Android system (e.g., native crashes) and test harness (e.g., code instrumentation).

As shown in Table 2, Sapienz \(^{div}\) revealed more total (6941 vs 5974) and unique (141 vs 119) crashes, and found faults in more apps (46 vs 43) than Sapienz. Moreover, it found 51 unique crashes undetected by Sapienz, Sapienz found 29 unique crashes undetected by Sapienz \(^{div}\), and both found the same 90 unique crashes. The results for the 66 apps provide initial evidence that Sapienz \(^{div}\) can outperform Sapienz in revealing crashes.

Table 1. Results on the 66 benchmark apps.

RQ3. Considering the minimal fault-revealing test sequences (i.e., the shortest of all sequences causing the same crash), their mean length is 244 for Sapienz \(^{div}\) and 209 for Sapienz on the 66 apps (cf.  Table 2). This provides initial evidence that Sapienz \(^{div}\) produces longer fault-revealing sequences than Sapienz.

RQ4. Considering the mean execution time of testing one app over 10 generation, Sapienz \(^{div}\) takes 118 and Sapienz 101 min for the 66 apps. Figure 3 shows that the diversity-promoting mechanisms of Sapienz \(^{div}\) cause a noticeable runtime overhead compared to Sapienz. This provides initial evidence about the cost of promoting diversity at which an improved fault detection can be obtained.

Study 2. In this study we use the same 10 F-Droid apps as in the statistical analysis in [17]. Assuming no Gaussian distribution of the results, we use the Kruskal-Wallis test to assess the statistical significance (\(p<\)0.05) and the Vargha-Delaney effect size \(\hat{A}_{12}\) to characterize small, medium, and large differences between Sapienz \(^{div}\) and Sapienz (\(\hat{A}_{12}>\) 0.56, 0.64, and 0.71 respectively).

Fig. 2.
figure 2

Coverage.

Table 2. Crashes and seq. length.
Fig. 3.
figure 3

Time (min).

Fig. 4.
figure 4

Performance comparison on 10 apps for Sapienz \(^{div}\) (Sd) and Sapienz (S).

RQ5. The results are presented by boxplots in Fig. 4 for each of the 10 apps and concern: coverage, #crashes, sequence length, and time (cf.  Study 1). The \(\hat{A}_{12}\) effect size for these concerns are shown in Table 3, which compares Sapienz \(^{div}\) and Sapienz (Sd-S) and emphasizes statistically significant results in bold. Sapienz significantly outperforms Sapienz \(^{div}\) with large effect size on all apps for execution time. The remaining results are inconclusive. Sapienz \(^{div}\) significantly outperforms Sapienz with large effect size on only 3/10 apps for coverage, 2/10 for #crashes, and almost 1/10 for length. The remaining results are not statistically significant or do not indicate large differences.

5.3 Discussion

Study 1 provided initial evidence that Sapienz \(^{div}\) can find more faults than Sapienz while achieving a similar coverage but using longer sequences. Especially, the fault revelation capabilities of Sapienz \(^{div}\) seemed promising, however, we could not confirm them by the statistical analysis in Study 2. The results of Study 2 are inconclusive in differentiating both approaches by their performance. Potentially, the diversity promotion of Sapienz \(^{div}\) does not results in the desired effect in the first 10 generations we considered in the studies. In contrast, it might show a stronger effect at later stages since we observed in the fitness landscape analysis that the search of Sapienz stagnates after 25 generations.

Table 3. Vargha-Delaney effect size (statistically significant results in bold).

6 Threats to Validity

Internal validity. A threat to the internal validity is a bias in the selection of the apps we took from [6, 17] although the 10 apps for Study 2 were selected by an “unbiased random sampling” [17, p. 103]. We further use the default configuration of Sapienz and Sapienz \(^{div}\) without tuning the parameters to reduce the threat of overfitting to the given apps. Finally, the correctness of the diversity-promoting mechanisms is a threat that we addressed by computing the fitness landscape analysis metrics with Sapienz \(^{div}\) to confirm the improved diversity.

External validity. As we used 5 (for analyzing the fitness landscape) and 76 Android apps (for evaluating Sapienz \(^{div}\)) out of over 2.500 apps on F-Droid and millions on Google Play, we cannot generalize our findings although we rely on the well-accepted “68 F-Droid benchmark apps” [6].

7 Related Work

Related work exists in two main areas: approaches on test case generation for apps, and approaches on diversity in search-based software testing (SBST).

Test case generation for apps. Such approaches use random, model-based, or systematic exploration strategies for the generation. Random strategies implement UI-guided test input generators where events on the GUI are selected randomly [3]. Dynodroid [14] extends the random selection using weights and frequencies of events. Model-based strategies such as PUMA [8], DroidBot [11], MobiGUITAR [2], and Stoat [29] apply model-based testing to apps. Systematic exploration strategies range from full-scale symbolic execution [18] to evolutionary algorithms [15, 17]. All of these approaches do not explicitly manage diversity, except of Stoat [29] encoding diversity of sequences into the objective function.

Diversity in SBST. Diversity of solutions has been researched for test case selection and generation. For the former, promoting diversity can significantly improve the performance of state-of-the-art multi-objective genetic algorithms [21]. For the latter, promoting diversity results in increased lengths of tests without improved coverage [1], matching our observation. Both approaches witness that diversity promotion is crucial and its realization “requires some care” [24, p. 782].

8 Conclusions and Future Work

In this paper, we reported on our descriptive study analyzing the fitness landscape of Sapienz indicating a lack of diversity during the search. Therefore, we proposed Sapienz \(^{div}\) that integrates four mechanisms to promote diversity. The results of the first empirical study on the 68 F-Droid benchmark apps were promising for Sapienz \(^{div}\) but they could not be confirmed statistically by the inconclusive results of the second study with 10 further apps. As future work, we plan to extend the evaluation to more generations to see the effect of Sapienz \(^{div}\) when the search of Sapienz stagnates. Moreover, we plan to identify diversity-promoting mechanisms that quickly yield benefits in the first few generations.