Keywords

1 Introduction

Mutation testing is a method for evaluating a quality of a test case suite and/or creating a set of test cases [1]. The main idea originates from the fault injection techniques. A change is introduced to a program under test. The change represents typically a possible mistake made by a programmer. It is assumed that the change could not be revealed by a simple program compilation. A changed program is called a mutant.

The main obstacle in application of a mutation testing process is its high cost. Therefore there are many approaches to its cost reduction [1, 2]. Some of them are based on lowering of the number of generated mutants and/or the number of performed test cases. Mutant clustering is one of the cost reduction methods that was considered for the mutation testing process [3, 4].

The first results on mutant clustering for object-oriented programs were reported in [5]. A general experimental scenario was proposed for evaluation of the tradeoff between mutation score accuracy and the complexity of a mutation testing process expressed in a number of generated mutants and a number of test cases. The scenario was adapted to three cost reduction techniques: selection of mutants, mutant sampling, and clustering. The detailed results of mutant clustering experiments for C# programs, the experimental scenario, and evaluation of a quality metric are given in [6].

This paper addresses another problem of mutant clustering. It is a question of how we can generalize the results of experiments on mutant clustering, which might be useful for other projects. Especially, it is interesting how to evaluate relations of mutants generated by different mutation operators. Therefore, three new metrics were developed, dealing with usefulness of mutants generated by a given operator, their frequency, and dependency among mutants. Based on these metrics and data gathered in our previous experiments [5, 6] the approach was applied. We could examine in a quantitative way the differences between standard and object-oriented operators, distinguish a pair of complementary operators, and classify operators that cannot be omitted in order to preserve the mutation result accuracy.

The paper is organized as follows: Sect. 2 describes briefly the basic notion of mutation testing and mutant clustering method. In Sect. 3 metrics used for analysis of mutant clustering results are introduced and illustrated by an example. Section 4 presents an experiment overview and results of the conducted experiments. Finally, Sect. 5 concludes the work.

2 Background

In this section basic concepts of mutation testing as well as an idea of mutant clustering and related works related to it are discussed.

2.1 Mutation Testing

In mutation testing, a program change is specified by a mutation program operator and introduced in an automatic way using a mutation tool. Standard, or so-called traditional, mutation operators deal with the common programming features, typical to all programming languages, like arithmetic, logical and relational operators, assignment statements, constant usage, etc. Different specialized programming features are also covered by the devoted mutation operators. Features characteristic to object-oriented languages (OO in short) are handled by object-oriented mutation operators proposed, for example, for Java [7] or C# [8, 9]. If a mutation operator is applied only once and in one place of a program, we speak about first order mutation.

Evaluation of a test suite is performed in a mutation testing process. For a given program and a set of selected mutation operators, a set of mutants is created. The mutants are run against tests from a test suite under concern. If a mutant behavior is different from the behavior of the original program, the mutant is said to be killed. Tests that are able to kill mutants should be good at revealing mistakes represented by the mutation operators. A mutation testing result, called a mutation score (MS), is calculated as a ratio of the number of all killed mutants over the number of all nonequivalent mutants. A mutant is equivalent if its behavior cannot be distinguished from the original program by any test. In many practical cases, instead of an exact mutation score, its approximate value is calculated, because it is not possible to classify exactly all equivalent mutants in an automatic way.

2.2 Mutant Clustering

There are many approaches to reduction of mutation testing costs based on lowering of considered mutants and therefore reducing also the number of test runs [1, 2]. One such analyzed solution was mutant clustering [3, 4].

The main idea of the mutant clustering originates on the concept of equivalence partitioning. A set of all mutants of a program is divided into groups, called clusters. The division is realized in the context of a given test set, similarly as in mutation score evaluation. Each group is characterized by the similar ability of being killed by the same subset of tests. Allocation of a mutant to a group can be realized by a clustering algorithm such as agglomerative hierarchical or K-means clustering [3, 10].

The mutant clustering is specified for a given set of mutants of a program under test, and a given set of tests. A threshold \(K\) denotes a resemblance between mutant groups. Two groups are said to be similar with K degree, if the number of tests that kill at least one mutant from one group and kill none mutant from the former group equals \(K\).

Next, in the mutation testing process, instead of all mutants, only one mutant for each group is used. This mutant represents the group (cluster) that should have the comparable features, as far as the subset of tests associated with this group is concerned. Usage of a reduced number of mutants lowers the mutation costs. However, the accuracy of the mutation score can be declined.

2.3 Related Works

Primary experiments on mutant clustering in mutation testing were conducted for C programs [3]. They reported considerable potential benefits, for example, usage of 13 % of all mutants and 8 % of tests gave a mutation score of a high accuracy (99 %). However, this result referred to a simple, not object-oriented programming language and only standard mutation operators, which usually are more redundant.

The above-mentioned result was based on full data, i.e., all mutants run against all tests. The practical solution to mutation clustering based on a static domain analysis was presented in [4]. The proof of concept was illustrated by a small Java program, for which satisfactory results were obtained, namely after running 25 % of mutants with 62 % of tests the mutation score was equal to 94 % of the exact mutation score.

Mutant clustering in the context of object-oriented operators was studied for the first time in the experimental process for comparing of different cost reduction techniques [5]. The detailed analysis of mutant clustering was discussed in [6]. The quantitative data of this experiment is recalled in Sect. 4.1. This paper provides further methods of clustering data analysis in order to generalize the results to other projects.

The research on cost reduction methods applied to C# programs was performed only by the author of [5, 6]. Most of the other work on object-oriented programs was done for Java programs [1113], but the clustering method was not considered.

3 Metrics for Generalization of Clustering Results

Cluster of mutants includes mutants generated by various mutation operators. Many mutants created with the same mutation operator (Op) can contribute to the same cluster \(\{Op1, Op2,\ldots \}\). Therefore, there can exist clusters that are mainly constituted by mutants of selected mutation operators. These mutants are the most probable representative of these clusters.

Mutants of the same operator can also be met in many clusters. Some pairs of operators can be associated and encounter in the same clusters. In such case it could be possible to omit one of the operators.

In order to quantitatively evaluate such phenomena the following metrics were used.

3.1 Metric Definitions

Three additional metrics were proposed in order to evaluate and compare the clustering results. Each metric is calculated for a mutation operator (Op).

The first metric is usefulness of mutants (UM). It calculates how big a subset of mutants is that are useful in the context of a given operator.

$$\begin{aligned} \textit{UM}(Op)=\frac{\textit{NG}(Op)}{\textit{NM}(Op)} \end{aligned}$$
(1)

where

\(NG(Op)\)—the number of groups, in which at least one mutant exists that was created using the Op mutation operator,

\(NM(Op)\)—the number of all mutants generated using the Op mutation operator.

The second metric, so-called frequency (FR), examines frequency of an operator occurrence. It calculates the amount of groups, which includes at least one mutant designed by the operator in relation to all group number.

$$\begin{aligned} \textit{FR}(Op)=\frac{\textit{NG}(Op)}{\textit{NG}_{All}} \end{aligned}$$
(2)

where \(\textit{NG}(Op)\)—as above, and \(\textit{NG}_{All}\)—number of all groups.

The third metric is called dependency (DEP). It evaluates dependency of an ordered pair of mutation operators.

$$\begin{aligned} \textit{DEP}(Op_1,Op_2 )=\frac{\textit{NPM}(Op_1 ,Op_2 )}{\textit{NM}(Op_1 )} \end{aligned}$$
(3)

where

\(\textit{NM}(Op_{1})\)—a number of all mutants generated using the \(Op_{1}\) mutation operator.

\(\textit{NPM}(Op_{1},Op_{2})\)—a number of occurrences of mutant pairs created with \(Op_{1}\) and \(Op_{2}\) operators. This value is calculated as a sum of all other groups of a minimum of two numbers: a number of mutants of a given group created using \(Op_{1}\), and an analogous number of mutants created by the second operator \(Op_{2}\).

$$\begin{aligned} \textit{NPM}(Op_1,Op_2 )=\sum _{g\in G} {\min (\textit{NM}(g,} Op_1 ),\textit{NM}(g,Op_2 )) \end{aligned}$$
(4)

where

\(\textit{NM}(g,Op_{1})\)—a number of mutants from the group \(g\) that were generated using the \(Op_{1}\) mutation operator.

It should be noticed, that the operator dependency metric is not symmetric, i.e. \(\textit{DEP}(Op_{1}, Op_{2}) \ne \textit{DEP}(Op_{2}, Op_{1})\).

3.2 Example

The metrics will be illustrated with a simple example. Three mutation operators were used for generation of mutants: EOC, IOP, and EXS. (The full operator names are listed in Table 1). Using EOC operator five mutants were created. Four mutants were generated with IOP operator and one mutant with EXS.

Table 1 Standard and object-oriented mutation operators (C# supported by CREAM v.3)

After performing an algorithm of mutant clustering four groups of mutants were specified. The result groups consist of the following mutants:

\(\textit{G}1 = \{ \textit{EOC}1, \textit{EOC}2, \textit{IOP}1, \textit{IOP}2 \}\)

\(\textit{G}2 = \{ \textit{EOC}4, \textit{EOC}5, \textit{IOP}4 \}\)

\(\textit{G}3 = \{ \textit{EOC}3 \}\)

\(\textit{G}4 =\{ \textit{IOP}3, \textit{EXS}1 \}\)

The usefulness metric UM was calculated for each operator in the following way:

$$\begin{aligned}&\textit{UM}(\textit{EOC})=\frac{\textit{NG}(\textit{EOC})}{\textit{NM}(\textit{EOC})}=\frac{3}{5}=0.6\nonumber \\&\textit{UM}(\textit{IOP})=\frac{\textit{NG}(\textit{IOP})}{\textit{NM}(\textit{IOP})}=\frac{3}{4}=0.75\\&\textit{UM}(\textit{EXS})=\frac{\textit{NG}(\textit{EXS})}{\textit{NM}(\textit{EXS})}=\frac{1}{1}=1.0\nonumber \end{aligned}$$
(5)

The calculated values can be interpreted as a useful part of mutants. For example, 60 % of mutants could be selected for the EOC operator and still in each group there would be at least one mutant created by this operator. However, all mutants (100 %) generated by the EXS operator are indispensable in order to ensure the same condition.

The frequency metric calculated for the example mutants gives the following values:

$$\begin{aligned}&\textit{FR}(\textit{EOC})=\frac{\textit{NG}(\textit{EOC})}{\textit{NG}_{All} }=\frac{3}{4}=0.75\nonumber \\&\textit{FR}(\textit{IOP})=\frac{\textit{NG}(\textit{IOP})}{\textit{NG}_{All} }=\frac{3}{4}=0.75\\&\textit{FR}(\textit{EXS})=\frac{\textit{NG}(\textit{EXS})}{\textit{NG}_{All} }=\frac{1}{4}=0.25\nonumber \end{aligned}$$
(6)

For a given operator, the metric assesses the frequency of mutants belonging to groups. For example, the metric of EOC is equal to 0.75. It means that 75 % of all groups include at least one mutant created using this operator.

Finally, the dependency metric is calculated for any ordered pair of mutation operators. We choose for example two operators EOC and IOP. Because the metric is not symmetric, two ordered pairs are considered: (EOC, IOP) and (IOP, EOC).

$$\begin{aligned}&\textit{DEP}(\textit{EOC},\textit{IOP})=\frac{\textit{NPM}(\textit{EOC},\textit{IOP})}{\textit{NM}(\textit{EOC})}=\frac{3}{5}=0.6\\&\textit{DEP}(\textit{IOP},\textit{EOC})=\frac{\textit{NPM}(\textit{IOP},\textit{EOC})}{\textit{NM}(\textit{IOP})}=\frac{3}{4}=0.75\nonumber \end{aligned}$$
(7)

Based on the first value (0.6) we can deduce that 60 % of mutants created by EOC can be substituted by IOP mutants. In the opposite case the value is different and is equal to 0.75. This means that 75 % of IOP mutants have a pair of EOC mutants in a group. Comparing both values of the metric, we can conclude that in this example it is better to substitute operator IOP by EOC (0.75) than vice versa (0.6).

4 Experiments on Mutation Clustering

Evaluation of the approach will be presented on experimental data. The metrics were applied to the analysis of mutation clustering results gathered in the experiments on standard and object-oriented mutation of C# programs [5, 6].

4.1 Experiment Setup

Data for the mutant clustering and evaluation of the metrics were collected in experiments carried out with the CREAM v3 tool. CREAM is a mutation testing tool for C# programs [14]. It was the first tool that supported object-oriented mutation operators for C# programs [15, 16]. Its third version was enhanced with an extension for efficient performing and evaluation experiments on cost reduction techniques: selection of mutants, mutant sampling, and clustering [5]. The tool supports 18 object-oriented operators and eight standard ones (Table 1).

The experiments were conducted on three commonly used open-source programs, Enterprise Logging (http://entlib.codeplex.com), Castle (http://www.castleproject.org) and Mono-Gendarme (http://www.mono-project.com/Gendarme). All first order mutants were generated for the mutation operators given in Table 1. Additionally, only mutants covered by tests from a given test suite were considered, as not covered mutants were not able to be killed by tests. Then all mutants were run against all test cases. The collected results were stored and used in the evaluation process of the cost reduction techniques [5]. For different cost reduction method the appropriate quality measures were calculated that allow to express the tradeoff between mutation score and the number of mutants and the number of tests.

The detailed results of the basic quality analysis of the mutation clustering approach are presented in [6]. For all mutants the agglomerative clustering algorithm was applied. Mutants generated by standard mutation operators and by object-oriented ones (in short—standard and OO mutants) were analyzed separately. The groups of mutants were formulated for the \(K\) degree of the clustering algorithm varying from 0 to 19. According to the quality analysis the best results were obtained, for \(K=1\) in case of object-oriented operators and \(K=2\) in case of the standard operators, assuming that the mutation score adequacy contributes of 60 % to the overall quality, whereas number of mutants and number of tests of 20 % each. The experiments showed that it was possible to use 32 % of OO mutants and 18 % of tests to obtain the mutation score of 97 % close to the original one (i.e. calculated using all OO mutants and all test). The analogues data for the best results of standard mutation was 19 % of mutants, 22 % of tests and 91 % of mutation score accuracy.

4.2 Evaluation of Mutation Clustering Results

Mutation data from the above-mentioned experiments were used in the further evaluation of mutation clustering results addressing the generalization problem. The evaluation was based on the metrics specified in Sect. 3. The results were analyzed separately for standard and object-oriented mutations. The metrics were calculated in respect to all mutation operators used in experiments.

Table 2 Usefulness of mutants (UM) and frequency (FR) metrics for standard and object-oriented mutation operators

Results of two metrics, usefulness of mutants and frequency, calculated for the subject programs and their average values are shown in Table 2. The upper part of the table includes values of standard operators, whereas the lower part gives data for OO operators. Empty places, denoted by ‘\(-\)’ character, correspond to cases when no mutant was generated from a given program (column) using this kind of operator (row).

Analyzing the first metric for object-oriented operators, we can observe that in most of the cases the value of usefulness of mutants is relatively high. A value can be counted as high if it is bigger than 0.8 for at least one program. Eight OO operators have at least one 1.0 (100 %) value, which means that for this program all mutants generated by this operator contribute as group representatives and could not be omitted in the mutation score analysis.

However, we can observe the PRV operator (Reference Assignment with other Compatible Type) for which this metric is low, i.e., about 0.3 for each program. Generating only about 32 % of all PRV mutants it is possible to create the same groups considering their member operators. Moreover, analyzing the frequency metric for PRV we have found that in average 9 % of groups includes at least one PRV mutant. This result is medium high in comparison to other operators but not negligible. In conclusion, it is worthwhile to limit the number of PRV mutants, as it is possible to reduce the mutant number considerably without loss of the mutation score accuracy.

Comparing results of the usefulness metric for object-oriented and standard operators we can observe that in general the values of standard operators are lower than the object-oriented ones. Only two standard operators (ABS and LCR) have a high value of the first metric. This confirms the other results [5, 11, 17] that among standard mutants can be more surplus (redundant) mutants than in the object-oriented mutants. It should be noted that this effect is visible although the set of standard operators of CREAM and therefore used in this experiment was very limited. It was based mainly on the operators classified as selective in the standard operator analysis [17].

Analogues reasoning for the PRV operator can be performed for selected standard operates, in particular ROR and UOR. In case of these operators, according to the first metric at least 40 % of mutants should be generated for each operator.

Results of the third metric—dependency are shown in Table 3 for standard mutation operators and in Table 4 for object-oriented ones. The tables include values averaged over all three programs examined in experiments. Operators DMC, IHD were omitted as no mutants were generated by them in the considered programs.

Analyzing the object-oriented operators, we can see that the maximum values 0.87 and 0.79 are calculated for two operators IOK and IOD. This result denotes that most mutants generated using the IOK operator (87 %) are in the same group as mutants created by IOD operator. The opposite dependency is satisfied in 79 %. Therefore, we can assume that resigning one such operator can reduce the mutation testing cost without considerable loss of the mutation score accuracy, because they are complementary, i.e., mutants of one operator can be substituted by mutants of the second operator. The slightly better choice is selection of IOK, because DEP(IOK, IOD) \(>\) DEP(IOD, IOK).

The dependency metric calculated for other pairs of object-oriented operators give in most cases very low results (about few %) or for several pairs results about 10–20 %. Therefore we cannot point at any other pair of object-oriented operators as being dependent in general.

Table 3 Dependency metric for standard operators
Table 4 Dependency metric for object-oriented operators

Considering the third metric for standard mutation operators (Table 3), we can find more results above 50 % than for the object-oriented operators. Five standard operators, namely ABS, AOR, ASR, LOR, and UOR can be partially substituted by the UOI operator. On the other hand the dependency is in the range of 50–60 % (only for ABS equal 75 %). Therefore the dependency is not as definite as in the case of the pair of IOD-IOK operators.

The last column in the tables with dependency metric includes the sum of the numbers in the corresponding row. This sum represents information about to what extent mutants created by the row operator can be substituted by any combination of the remaining operators. The lower the sum, the more reasonable it is to retain this operator in the mutation testing process.

Analyzing this sum of object-oriented operators, we can see that the highest values are for operators IOD, IOK (already recognized as complementary ones) and the IOP operator. Therefore the remaining object-oriented operators should be applied, i.e., the majority of OO operators. This evaluation confirms the fact that OO operators correspond to various advanced programming features and need more specific tests.

For the standard operators the sums are in general higher than for OO operators. Operators LCR, ROR, and UOI turned out to be the most applicable as they have the sum below zero. This result is consistent with the findings on the selective mutation in C# programs [5].

5 Conclusions

This paper presents a study on the result evaluation of mutation clustering. It copes with the question: How can the clustering results be generalized and associated with the selection of mutation operators. The problem was examined with three new metrics about mutant usefulness, frequency, and dependency in terms of mutation operators used for the mutant generation. The metrics were applied to mutant clustering results on three real-world C# programs mutated with standard and object-oriented operators.

Combining the results of usefulness and frequency metrics, we can observe that reducing the number of generated PRV mutants gives noticeable mutant cost reduction without a loss of the mutation score accuracy. It is also worthwhile to select only one operator among IOK and IOD operators. The lessons learned point at different characteristics between structural and object-oriented mutation operators. In general, less OO mutation operators can be omitted if an adequate mutation result has to be assured. This fact can be caused by the higher specialization of the OO mutation operators than the standard ones. For the standard mutation operators, even basing of a preliminary reduced set of operators (including all selective according to [17]), we can still reduce the number of generated mutants. Among standard operators the most useful were LCR, ROR, and UOI, which corresponds to the results on selective mutation in OO programs based on other methodologies [5, 11].