1 Introduction

Over time, the world has become increasingly dependent on information technology and software has now become an essential part of much of everyday life. This is also true for businesses in a wide variety of industries, including airlines, media, security agencies, and education. As a result, many methods and strategies have been developed to increase the reliability and efficiency of software development projects in order to satisfy discerning customers (Alkharabsheh et al. 2018; Felderer and Herrmann 2019; Horváth et al. 2019; Zhao et al. 2019).

Many modern development strategies, such as agile methodologies, have been developed to fulfill increasing demands to deliver high-quality software in a short amount of time with limited resources in an extremely competitive environment (Agren et al. 2018; Anand and Dinakaran 2017; Heck and Zaidman 2018). These strategies address frequent changes and releases, team coordination, customer collaboration, continuous releases, etc. Continuous integration (CI) methodologies and frequent releases are now standard practice in many software organizations when it is essential to increase the speed of delivering new products and features (Haghighatkhah et al. 2018; Horváth et al. 2019).

However, the number of changes introduced in each release or iteration to support new or existing features significantly increases the number of bugs in the code (Kandil et al. 2015, 2016; Knauss et al. 2015; Thangiah and Basri 2016). This forces the team to prioritize fixing bugs over regression testing, which compromises the quality of the released software. It is much easier to reduce the regression requirements when there is only a small number of user stories as retesting all test cases after each change for a large number of user stories quickly becomes unworkable (Elbaum et al. 2014; Knauss et al. 2015; Rosero et al. 2017).

Agile practices rely on regression testing (RT), which requires a significant amount of effort and resources during implementation (Hettiarachchi et al. 2016; Thangiah and Basri 2016). Agile approaches utilize recursive implementations and test sequences to reduce the interval between tests (Huang et al. 2012; Kandil et al. 2014). RT is commonly used to verify the quality of software applications after modifications have been made in the development process. It is a maintenance activity that is executed to provide assurance that the variations and/or modifications have not adversely affected the present functionality of the software (Ansari et al. 2016; Spieker et al. 2017). As large test suites require a significant amount of time and expense to execute, various techniques have been proposed to reduce the time and cost of execution, such as test case prioritization (TCP), regression test selection (RTS), test suite minimization (TSM), and test suite augmentation (TSA) (Do 2016; Spieker et al. 2017). RT reuses previous test suites as well as any new test cases that have been added to validate new or modified features (Anderson et al. 2014; Do 2016). The TSM technique is used to prune outdated or obsolete test cases, and regression selection selects a subclass of test suites with which to validate modifications (Kandil et al. 2015; Miranda and Bertolino 2017). TCP efficiently orders the test cases based on specific requirements, such as early fault detection or coverage rate (Kandil et al. 2014; Panichella et al. 2015; Rosero et al. 2016; Rosero et al. 2017). The TSM and RTS methods identify test cases that should be permanently removed from the test suite in order to focus on identifying significant faults in the changed parts of the software (Al-Hajjaji et al. 2019; Ansari et al. 2016; Wang et al. 2019). TSA identifies newly added code and creates new tests to validate the functions of the revised system (Kandil et al. 2016; Miranda and Bertolino 2017; Panichella et al. 2015). TCP is used to organize the complete test suite to facilitate earlier fault detection within a limited time and cost based on various criteria, such as code coverage, faults, historical information, and requirements (Al-Hajjaji et al. 2019; Azizi and Do 2018; Chen et al. 2018; Flemström et al. 2018; Wang et al. 2019).

A significant amount of research has been conducted to determine how to efficiently prioritize and select test cases to increase the fault detection rate. However, existing RT techniques in the agile environment lack the ability to effectively prioritize and select test cases. The irrelevant and redundant test cases execution results in the identification of redundant faults and repeated re-execution of the test suite.

Therefore, to address these issues, contributions of our research work are as follows:

  • In our research, we present a test case prioritization and selection approach for agile strategy in CI context on criteria of frequently changed and failed test cases, to increase the capability of fault detection rate.

  • The proposed model offers test case prioritization and selection using frequently change test cases and failed test cases due to CI at every release in agile development strategies. Consequently, we divide our proposed model called CTFF (prioritize and select frequently change test cases and failed frequency) in two phases. The first phase consists of clustering frequently change test cases into multiple clusters. Afterwards, parameters for prioritization are generated based on the highest failure frequency, and if more than one test cases have similar frequency, then test code coverage used as second priority criteria. In the second phase, test cases with highest priority are selected from each cluster for execution to identify maximum faults.

  • Therefore, the proposed approach resolves drawback of RT in CI through modern development strategies. The CTFF model evaluated using three different software with different size of test cases developed in an agile environment. From the analysis of results, it is revealed that the CTFF model detects more faults and effective in all cases as compared to random prioritization and other fault-based methods.

  • The study provides a roadmap, baseline, and empirical evidence for future research in domain of RT for continuous integrations.

The remainder of this paper is organized as follows. Section 2 describes the related work whereas Section 3 presents the methodology in which CTFF model is introduced. The details of the empirical study are presented in Section 4. Results and discussion are presented in Section 5, and conclusions drawn from this research are presented with future work direction in Section 6.

2 Related work

This section summarized existing literature in regression test prioritization. Existing acknowledged literature shows researchers’ probable diverse approaches used to improve the TCP and RTS techniques for regression testing using agile methods.

Studies in the literature on agile RT have suggested that the optimum time for regression testing techniques for agile environments is at the sprint and release level (Anita, and Chauhan, N. 2014; Haghighatkhah et al. 2018; Kandil et al. 2016; Thangiah and Basri 2016). At the sprint level, the authors in Kandil et al. (Kandil et al. 2016) proposed a weighted sprint TCP technique that orders test cases on the basis of three parameters. While at the release level, a cluster-based release TCS technique was proposed to group user stories according to the similarities between the modules. Whereas, the selection of test cases was dependent on features identified by the faults in the failed test cases via text mining techniques. However, no prioritization technique was proposed at the release level to prioritize the selected test cases in order to determine which sequence of test cases must be completed to satisfy the testing objective. In addition, no selection technique was proposed at the sprint level to assemble an optimal regression test suite from the entire test suite. The benefit of selection at the sprint level is that this will reduce the number of test cases, which will then reduce the time and effort expended.

Kandil et al. (2015) reduced the number of test cases by evaluating the number of user stories with similar issues that are covered by different test cases. The remaining test cases were then prioritized using weighted agile parameters to increase the detection rate of faults. Knauss et al. (2015) found that executing continuous integration activities in a large software development project is difficult due to structural, social, and practical differences between organizations. It is therefore essential that the test cases be prioritized so that faults can be detected quickly, and bugs can be identified.

Some of the other common regression testing TCP techniques developed by various authors are as follows. Coverage-based TCP uses coverage information to reorder the test cases in order to identify the maximum number of faults (Horváth et al. 2019; Miranda and Bertolino 2018; Spieker et al. 2017). Coverage-based techniques use criteria for prioritization such as the statement, branch, or path coverage (Gupta et al. 2015; Horváth et al. 2019). History-based techniques employ historical data to prioritize the test cases in future sessions, which requires historical information about the test cases to be maintained, such as the execution history of the test cases, the fault detection rate (Abu Hasan et al. 2017; Al-Hajjaji et al. 2019; Aman et al. 2018; Azizi and Do 2018), and other factors (Azizi and Do 2018; Haghighatkhah et al. 2018; Miranda and Bertolino 2018).

Other types of TCP for regression testing include human and probability-based techniques. Lin et al. (2013) improved prioritization in the current version by referring to the results of earlier versions. The experimental results identified similar types of test cases and found that the proposed approach outperformed existing approaches. In addition, the method improved software quality while ignoring test changes and continuously created test cases to cover new features. The conclusion from the above-referenced studies is that methods are needed to more accurately prioritize test cases based on historical information to improve long-term software performance.

Wang and Zeng (2014) employed a prioritization model to add flexibility to the test case prioritization process based on a multi-dimensional equation for sorting test cases for earlier execution. They also found that the performance of dynamic test case prioritization strategies could be improved by considering weighted probability distributions. However, a limitation is that this technique ignores test changes and historical fault information and fails to maintain a repository for future regression tests. Historical information with code coverage is more effective in terms of fault detection than approaches in which only test cases covering changed lines were executed (Gupta et al. 2015).The use of code-coverage information in prioritization techniques improves maximum fault detection but overlooks the effectiveness of fault rate detection and does not consider the requirements of regression testing.

Miranda and Bertolino (2017) proposed a hybrid approach for software reuse that improved the fault detection rate and reduced the size of the test suite. However, they did not consider coverage information pertaining to the reuse of test cases in regression testing. Silva et al. (2016) presented a hybrid approach for TCP and RTS, which was based on the relationship between system components, was found to solve problems effectively with high quality. However, it did not maintain a repository for historical information and did not consider the fault information of test cases when performing regression testing on new versions of the software. Consequently, large number of test cases were generated which degraded the fault detection ability.

The prioritization approach by Wang and Zeng (2016) used historical data and prioritized requirements to determine the initial test case priority. Even though the performance of this approach was acceptable, redundant faults were detected during fault divisions to requirement property. The coverage information and change in test cases were not used for regression testing. In addition, while multi-criteria were used for prioritization, the frequency of test changes was ignored during prioritization. Abu Hasan et al. (2017) presented dissimilarity clustering TCP using historical information to identify maximum faults in less time.

Elbaum et al. (2014) proposed an algorithm that increased the effectiveness of continuous integration. The initial pre-submit phase applied RTS techniques to select test suites for the specific modules to be tested. In the post-submit phase, dependent and changed modules were tested and prioritization techniques utilized to sequence test cases to increase the likelihood of earlier fault detection. Both phases employed novel techniques and proposed less expensive algorithms. However, the proposed selection and prioritization techniques did not account for the capability and variability of the computing infrastructure and only considered a limited dataset. The problem with that approach is that different datasets are necessary in order to properly understand the RTS limitations when the size of the code and the number of changes increase. Other factors, such as the fault severity and importance of the user stories, are also significant in agile environments.

Kandil et al. (2014) proposed an RTS approach that analyzed the historical relationship between test case failures and code changes when determining the optimal test suite. However, this approach employed a historical RTS technique that required the maintenance of repositories to store the historical data, which comes at a high cost, and this technique was not automated. To date, this approach has only been used in two scenarios requiring a restructuring of the test suite and rearranging of the effort across the test scope.

Anita and Chauhan (Anita,, and Chauhan, N. 2014) proposed a method of test selection that used a weighted undirected graph of user stories based on the average path length and value constraints. Test cases were selected based on their relevance to the specific source and destination user stories. The advantage of this technique was that it selected an optimal set of user stories to ensure high levels of quality and action. However, it was not automated and test cases were selected only based on optimized user stories, and without any prioritization of the test cases when determining the order of execution. This is problematic as other factors should also be considered when selecting test cases in an agile environment.

Azizi and Do (2018) proposed TCP-based collaborative filtering recommender system using change historical information in dynamic environment for decision-making process. They observed that multi-criteria can improve the effectiveness of TCP and need to increase fault rate with new item additions in an intelligent way. Haghighatkhah et al. (2018) proposed RT for fault detection in a continuous integration environment. Availability of failure history data is an important criteria, but only improves effectiveness to a certain extent while history-based diversity is more effective but has disadvantage of high execution time. Ouriques et al. (2018) compared different existing TCP techniques in context of model-based testing using replicated study to investigate influence of test case size on the efficiency of fault detection rate ability.

Al-Hajjaji et al. (2019) proposed similarity-based TCP technique for product line with diverse feature interaction coverage. The study analyzed the effectiveness in both real and seeded fault detection, after the evaluation on three different applications of distinctive feature size. Horváth et al. (2019) investigated the impact of code coverage-based Java language tools on TCP and TSM, and found that coverage information is useful to highlight number of line code covered by each test case for optimization during RT. Whereas, Wang et al. (2019) for embedded systems, proposed location-based TCP using gravitation law for high reliability after modification. Shin et al. (2018) defined multi-objective TCP method for uncertainty prediction in cyber physical systems.

Other existing studies leveraged the TCP and RTS techniques to effectively identify the maximum number of fault as soon as possible with a reduce number of test cases (Azizi and Do 2018; Flemström et al. 2018; Haghighatkhah et al. 2018; Noor and Hemmati 2015; Spieker et al. 2017; Wang and Zeng 2016). However, most of these prioritization techniques only considered code coverage criteria and ignored information pertaining to the tests that frequently changed, which limited the ability to reduce the number of test cases and identify the maximum number of faults (Azizi and Do 2018; Fischer et al. 2018; Flemström et al. 2018; Lu et al. 2016; Ma et al. 2019; Noor and Hemmati 2015). Another limitation is that code coverage criterion alone is not sufficient to identify a maximum number of faults, and multi-criteria are necessary for optimal fault detection (Lachmann et al. 2015; Ouni et al. 2017; Shin et al. 2018).

TCP techniques have also been used along with other criteria to sort test cases for maximum fault detection based on changes in the coverage information, fault rate, and historical information (Aman et al. 2018; del Sagrado and del Águila 2018; Horváth et al. 2019; Magalhães et al. 2017; Mahali and Mohapatra 2018), and a few studies also considered test changes (Almasri et al. 2017; Azizi and Do 2018; Chen et al. 2018; Lu et al. 2016). However, there is still a need to improve TCP techniques in order to locate latent bugs (Alkharabsheh et al. 2018; Azizi and Do 2018; Felderer and Herrmann 2019; Miranda and Bertolino 2017; Özdağoğlu and Kavuncubaşı 2019; Wang et al. 2019). It would also be extremely beneficial to identify a technique that is able to detect the maximum number of faults in the shortest possible time.

A summary of the existing techniques based on respective factors and their limitations like low fault detection ability, considering only code criteria, continuous integration, etc., is provided in Table 1. From literature, we identified that the main reasons for proposing different regression testing techniques are to increase fault detection ability with reduction of redundant faults and irrelevant test cases using different criteria like coverage and historical information. Consequently, these studies still lack in improving the rate of fault detection due to some factors. These factors are fault detection ability (FDA), code coverage (CC), tests changes (TC), fault rate (FR), multi-criteria (MC), irrelevant test cases (ITC), and continuous integration (CI) as described in Table 1 with reasons described in existing literature. These factors are identified in multiple existing studies and are relevant to scope and context of our work, and provide the basis for comparison of results.

Table 1 Limitations in existing techniques

As explained above, most existing RT techniques are not intended for use in agile development, while there is room for improvement in those that do improve the quality of modern development strategies for RT to provide maximum and earlier fault identification. In this research, a test case prioritization and selection model is proposed for agile methodologies in a CI context to identify frequently changed and failed test cases in order to increase the fault detection rate. This model employs test case prioritization and selection by clustering frequently changed test cases, produced by CI at every release.

The proposed model, called the CTFF model, prioritizes and selects frequently changed test cases and the corresponding failure frequencies. The process consists of two phases. The first phase clusters redundant frequently changed test cases into multiple clusters and then generates prioritization parameters based on the highest failure frequency. If multiple test cases have similar failure frequencies, then test code coverage is used as the second priority criterion. In the second phase, test cases are selected for execution from the individual clusters to identify the maximum number of faults. In this way, the proposed model addresses a key limitation of RT in CI.

Thus, a key objective of the CTFF model is to enhance the RT process for agile applications by increasing the fault detection rate while excluding irrelevant and/or redundant tests and considering the test cases that frequently change and/or fail.

3 Methodology

In this section, we present the proposed model and describe its internal processes and working. In the CTFF model, applied hybrid techniques of regression testing to mitigate the shortcomings of existing techniques, i.e., irrelevant test case selection and redundant faults. The test case change history is used for clustering to reduce the size of test cases, frequently failed test cases for prioritization of test cases with code coverage to remove ambiguity in case of same priority, and then select highest priority test cases from each cluster to identify maximum faults as quickly as possible.

3.1 Proposed CTFF model

A workflow of the proposed CTFF model is shown in Fig. 1, along with major phases in the CTFF model. The motivation behind building such a model is to improve the implementation of RT in agile development scenarios to ensure software quality at the end of every feature or product release. The model is based on a hybrid of TCP and RTS and considers test case coverage instead of code coverage criteria during prioritization to select the most frequently changed and failed test cases. To better accommodate large numbers of test cases, the cases are clustered according to their change frequency. If two clusters have similar change frequencies, then they are prioritized based on their frequency of failure and coverage criteria. The underlying premise is that if there are changes in the test cases, then the likelihood of faults increases, which then affects the functionality of the software.

Fig. 1
figure 1

CTFF model

In the following, the main phases of the CTFF model are briefly described:

  1. 1.

    The first phase is the test case extraction. This phase consists of the following stages.

  • User stories—In the agile environment, user stories are used to define the functionality that should be provided by the system. A user story explains what the user expects from the system and represents the business value that an agile cross functional team should add to the product in a sprint. In this step, user stories are stored in a file for the selected sprint. Each user story can have multiple related test cases.

  • Test suite for current release/iteration—The user stories associated with the release are captured in order to extract all associated test cases.

  1. 2.

    The test case clustering process includes the following steps.

  • Identify frequently changed test cases—Identify test cases that frequently change across different releases of software user stories. The frequency of change is computed by taking an average of the changes in similar test cases in each version of modified test cases. The frequency of changes can be calculated as follows:

    $$ {T}_f=\left[\raisebox{1ex}{$\left(\sum {T}_c\right)$}\!\left/ \!\raisebox{-1ex}{$n$}\right.\right]{T}_f=\left[\raisebox{1ex}{$\left(\sum {T}_c\right)$}\!\left/ \!\raisebox{-1ex}{$n$}\right.\right] $$
    (1)

    where Tfis the change frequency of the test cases, Tc is the total number of test cases changes in each release, and n is the number of times the software was modified.

  • Cluster—In the cluster module, test cases are clustered into groups with similar frequencies of change. In other words, test cases with similar change frequencies are grouped in the same cluster. The CTFF algorithm employs a semi-supervised clustering technique called semi-supervised K-means, which combined semi-supervised nonlinear dimensionality reduction and K-means (Abu Hasan et al. 2017; Arafeen and Do 2013; Gultepe and Makrehchi 2018; Ni et al. 2017). In the K-means clustering process, the total number of clusters depends on k, which is the number of elements used to group tests into clusters. This process is illustrated in Fig. 2; which start from extracting data information using k value to cluster data based on similarity. At the end, we get different clusters of unstructured large data to reduce complexity and ambiguity.

    The distance between the test case frequency and the centroid of a cluster is based on the Euclidean distance (ED) as per Eq. (2) (Arafeen and Do 2013; Gultepe and Makrehchi 2018; Ni et al. 2017) where dcf is the ED of test cases tc and tf, m is the number of instances, and x is the instance feature value. Clustering was performed using the SPSS tool.

    $$ {d}_{cf}=\sqrt{\Big(\sum mk-1{\left({X}_{ck}\hbox{--} {X}_{\mathrm{f}k}\right)}^2} $$
    (2)
  1. 3.

    After the test cases have been extracted and clustered, prioritization and selection are performed.

  • Generate parameters—First, metrics are generated for use when assigning priorities, the most important of which is the number of frequently failed test cases. The error description is captured from previous test executions and is then used to prioritize test system functionality as follows.

    $$ {F}_p=\frac{\left(\sum {TF}_i\right)}{n} $$
    (3)

    where TFi is the total number of times the test cases failed, and n is the total number of failed test cases. Code coverage criteria are used for prioritization if there are multiple clusters with similar failed frequency percentages. In this case, the percentage of test cases covered is used instead, which is defined as the total number of code lines covered by each test case.

  1. 4.

    Prioritize test cases—In this step, the test clusters are ordered and prioritized to ensure that the most important cases are selected first.

  2. 5.

    Selection—Finally, the test suite is formalized to include all selected test cases from among all of the clusters of large test cases that were identified as having the highest failed frequency or highest coverage.

    Fig. 2
    figure 2

    Clustering of test cases

4 Empirical study

In this section, we investigate the effectiveness of the CTFF model described in Section 3. The efficacy of the CTFF model is assessed by means of a practical assessment focused on the following two research questions (RQs):

RQ1: Does the CTFF able to increase the fault detection rate during RT using clustering frequently changed test cases and previous fault information?

In this RQ, we analyzed that either CTFF model increases the number of faults exposed or not. As most existing studies in the literature did not cluster frequently changed test cases, nor did they initialize the RT process based on test case changes.

RQ2: Is the proposed CTFF technique more effective and efficient as compared to random prioritization and fault-based techniques?

To investigate effectiveness of the CTFF performance in terms of Average Percent of Faults Detected (APFD) and f-measure for test suite prioritization and selection as compared to existing random prioritization and fault-based procedures.

RQ3: Is performance of CTFF technique affected by the size of the test cases particular to changes?

The objective of RQ3 to investigate the impact of selected test case sizes on the efficiency of CTFF fault detection ability as compared to other techniques and to reduce irrelevancy in TC selection and redundant fault detection in test prioritization.

4.1 Industrial systems

Three cases have been selected for performance evaluation of CTFF model. The selection of three different cases was based on scope, domain, and backgrounds. One of the cases is the open-source system, i.e., iTrust and other cases are from real-world industrial application; due to confidentiality reasons, links to the involved companies were omitted and their names were replaced by alphabetical numbering.

Case A (CA) iTrust is a patient centric electronic health record system. It was developed by the Realsearch research group team at North Carolina State University for patients to electronically record health-related information (Arafeen and Do 2013; Hettiarachchi et al. 2016). An iTrust industrial system was used when estimating the performance of the CTFF method. The iTrust system is an open-source application. In terms of the corresponding test suite and test case change information, appropriate test cases (TC) were developed, and other artifacts were generated, such as a traceability matrix, requirement specification, and requirement-modification history, by the developers of iTrust.

Case B (CB) is a web-based IT services application that describes a large information system (we obtained this application from a GR Solutions Private Ltd. company Islamabad). The application was written in Java language and different technical relevant information about construction-based projects. The company providing Case C (CC) is a customer relationship management (CRM) (Azizi and Do 2018) system written in Java language to provide users a customized environment to create their own shapes and diagrams without to writing code. Table 2 shows prioritization criteria for representative applications. The data contains already verified use cases in which no real faults exit so used mutant faults.

Table 2 Prioritization criteria

Test case change history data is also required and we derived this data from version base of each respective application. The performance test of the CTFF model is based on three releases of each application. A flow chart of this empirical study is shown in Fig. 3; which starts from extracting user stories and recording requirement tasks. Then, extract TCs for clustering and end after selection of optimal set of TCs to execute from large bundle of TCs. The other techniques, i.e., random prioritization (RP) and fault-based (FB) used for effectiveness measurement. The RP prioritize test cases simply with random priority assignment to each test case for execution, and randomly assign to all application test cases for fault detection. While in FB, historical information about test cases is used, i.e., previous failure rate for priority and selection of test cases.

Fig. 3
figure 3

Case study flow chart

The extracted requirement specification after conversion into user stories or task cards is shown in Fig. 4, which depict list of some user stories for every release and complete system. A snapshot of test cases that were extracted based on the user stories or tasks cards for current release with complete execution steps are shown in Table 3. The test file consists of different test cases according to each user’s story with steps of test case execution and verification at the end of every release.

Fig. 4
figure 4

User stories file

Table 3 Test cases file

Note that the test cases were extracted for every release of the system. Then, the average number of changes in the test cases was computed using Eq. (1). For example, TC − 1 changes 4 times and a total number of releases or modifications are 6; the change frequency is 0.66 times. All those test cases with similar change frequency in a single cluster and test cases with different change frequency placed in different clusters. Therefore, for CA, five clusters were extracted out of 143 TCs in one release.

The similar change frequencies of the test cases were clustered using the SPSS 19 tool. Figure 5 depicts the number of clusters and the number of cases in each cluster for two releases of CA, while Fig. 6 illustrates the centroids of the clusters to explain that less frequently change TCs placed in cluster one, i.e., C1 and more frequently change TCs placed in C5. Parameters were then generated based on the frequently failed tests as the first priority criteria and test coverage as the second. Therefore, for iTrust application, we depict the hierarchy of the clusters after classifying the test cases based on the number of changes, as shown in Fig. 7, where the x-axis indicates the details of the test cases and the y-axis indicates the numbers of clusters. From Fig. 7, it is clear that five numbers of clusters were created on the base of similarity of change frequency of TCs. For example, in release 1 of 143 test cases included then using SPSS tool (for Eq. (2) calculation) to identify clusters of TCs and in cluster one, i.e., C1 include TC − 64, TC − 57, …, n and same in all clusters based on changed frequency.

Fig. 5
figure 5

Clusters

Fig. 6
figure 6

Clusters centroids

Fig. 7
figure 7

Hierarchical view of clusters

4.2 Mutant faults

Mutant faults were introduced into the iTrust code to properly validate fault detection. Mutation testing is a method used to assess the completeness of a test suite by implanting seeded errors into the application. This is used to determine whether the test suite can distinguish these changes at a significantly lower cost than hand-seeded faults. Different mutation operators are utilized, such as arithmetic administrator change (e.g., the addition (+) operator is supplanted with a (−), (*) or (/)), Logical Connector Change (e.g., the AND connector with an OR or XOR connector), Relational Operator Change (e.g., the (>=) operator is replaced with (<=), (==), (!=)), Access Flag Change (e.g., this operator changes a private access flag to a free access flag), Overriding Variable Deletion (erases a declaration of overriding factors), Overriding Variable Insertion (embeds factors from a parent class into the child class), Overriding Method Deletion (erases a declaration of an overriding method in a subclass so that the overridden method is referenced), and Argument Order Change (changes the order of arguments in a method invocation, if there is more than one argument) (Arafeen and Do 2013; Hettiarachchi et al. 2016).

In actual testing scenarios, programs do not typically contain as many faults as these numbers of mutants. Thus, we introduced mutant faults, which were formed by randomly selecting mutants from the pools of mutants created for each version to measure effectiveness of the CTFF model.

4.3 Evaluation metrics

The following metrics are used in the evaluation of the proposed approach.

4.3.1 APFD

This metric is used to compute the fault detection rate over the entire prioritized test suite (Azizi and Do 2018; Fischer et al. 2018; Li et al. 2018; Magalhães et al. 2017). The higher the value of the fault detection rate, the earlier the maximum number of faults will be detected during regression testing. This metric can be computed as follows.

$$ \mathrm{APFD}=1\hbox{--} \frac{\left({\mathrm{TF}}_1+{\mathrm{TF}}_2+\dots +{\mathrm{TF}}_m\right)\ }{(mn)}+\frac{1}{2n} $$
(4)

where TFi is the number of first test cases in order of execution, m is the total number of faults identified in the program, and n is the number of test cases.

4.3.2 Precision (P)

The P measure indicates the accuracy with which test cases were selected to be rerun (Kandil et al. 2016; Ni et al. 2017; Rosero et al. 2017), and can be computed as follows.

$$ P=\frac{\mid {\mathrm{T}}^{\prime }f\mid }{\mid {\mathrm{T}}^{\prime }f+{\mathrm{T}}^{\prime }r\mid } $$
(5)

where Tf represents the set of selected test cases that revealed faults, and Tr represents the test cases that did not reveal faults.

4.3.3 Recall (R)

The R measure indicates the percentage of selected test cases relative to all failed test cases (Kandil et al. 2016; Ni et al. 2017; Rosero et al. 2017) and can be computed as follows.

$$ R=\frac{\mid {\mathrm{T}}^{\prime }f\mid }{\mid {\mathrm{T}}^{\prime }f+{\mathrm{T}}^{\prime }n\mid }, $$
(6)

where Tn represents the set of tests that were not selected and did not fail. A value close to one indicates a high level of accuracy.

4.3.4 F-measure (F)

This measure is a combination of both P and R, which indicates the overall efficiency of the optimal test cases selection process (Kandil et al. 2016; Ni et al. 2017; Rosero et al. 2017). It can be computed as follows.

$$ F=\frac{\left(2\times P\times R\right)}{\left(P+R\right)} $$
(7)

5 Results and discussion

The results of the practical assessment with respect to RQs are as follows.

RQ1: The results of evaluation confirmed that the approach adopted in the CTFF model to cluster frequently changed test cases during process initialization significantly increased the fault detection rate. When the CTFF procedure was applied to the CA, CB, and CC datasets, test cases with high priority were identified. Identical scenarios were used to assess the performance of random and fault-based ordering techniques for the purpose of comparison. In CTFF technique after applying clustering, we extracted different clusters with similar change frequency of different TC size in all three cases. Then, calculated failed frequency by dividing total number of times failed TC over total time changes (using Eq. (3)). In the next step of CTFF technique, after the selection of Frequently Failed Test Cases (FFTC) from each cluster for CA, CB, and CC, we prioritized TCs in order to highest Failed Test Cases Frequency (FTCF). In case of a tie in FTCF, we used Test Cases Coverage (TCC) as 2nd priority criteria for tie breaking. The detail of priority and frequency criteria of selected TCs are listed in Table 4 of all techniques.

Table 4 Prioritization criteria

The order of TCs used by various techniques are listed in Table 5 for all selected cases and depicted different TCs for execution to detect maximum faults in first execution. The faults detected by each method are depicted in Fig. 8 with a comparison of other methods in three cases. The x-axis indicates the various TC executed while y-axis highlights a number of faults detected for that test case.

Table 5 Prioritized test suites
Fig. 8
figure 8

Comparison of the techniques

The results show that the CTFF technique detected almost 100, 98, and 99% of the faults in CA, CB, and CC respectively after first execution of TCs. While the performance of other methods on the first run identified only 20, 30, and 40% for all cases (fault-based) and 40, 20, and 50% for all three cases (random prioritization) of the faults, respectively. Hence, Fig. 8 with box plots explained that most of faults detected earlier in all cases, i.e., CA, CB, and CC using CTFF technique as comparison to RP and FB. In RP and FB required many executions for detecting faults which increases time and cost of development.

RQ2: We investigated the effectiveness of the CTFF with comparison to other methods for all cases using APFD evaluation metric of Eq. (4). The results indicate that APFD for the CTFF model is at higher rate as compared to those of existing methods (i.e., RP and FB) and is shown in Fig. 9. In all three cases, CTFF has higher APFD value which means that CTFF’s prioritization criteria is key component and have a higher ability to identify faults than other methods in all three cases.

Fig. 9
figure 9

APFD of all techniques

The results of the performance analysis are shown in Table 6 and Figs. 10, where it can be seen that CTFF approach successfully identified an optimal set of test cases and increased the fault detection capability. As shown, the precision, recall, and F measure values using Eqs. 5, 6, and 7 respectively are calculated for the CTFF efficiency analysis. The results indicate that test cases selected using CTFF have higher efficiency (i.e. 0.95, 0.92, and 0.96 points F measure for CA, CB, and CC respectively) than those of random prioritization (i.e., 0.56, 0.69, and 0.52 points F measure for CA, CB, and CC respectively) and fault-based methods (i.e., 0.68, 0.67, and 0.67 points F measure for CA, CB, and CC respectively) in CA, CB, and CC. Figure 10 a, b, and c for CA, CB, and CC respectively show that CTFF is more efficient than other methods. Consequently, results proved that selected test suite in CTFF for CA, CB, and CC is optimal set of TCs with a higher degree of fault detection. Whereas, using RP and FB techniques for CA, CB, and CC demonstrate less degree of fault detection ability.

Table 6 Evaluation metric analysis
Fig. 10
figure 10

Performance analysis for CA, CB and CC

RQ3: In this, evaluated impact of size of failed test case selection for execution on irrelevancy and redundant faults using CTFF in all three cases. Further, compared results with other techniques to provide more evidence about size variation sensitivity to irrelevancy and redundancy. However, size variation for test suite execution was not the same for each technique in all cases. The CTFF is less sensitive to test suite size for prioritization and selection as compared to RP and FB. For test suite size variation reduction, we divided total number failed TCs selected and prioritized over total TCs failed. For large code coverage, we divide number of lines of code covered by failed selected TCs for execution over total number of lines of code covered by each TC. The results demonstrate that CTFF technique has no issue of irrelevant TC selection and redundant faults after prioritization as shown in Fig. 11. The y-axis depict the relative percentage of TC to size variation for all three techniques on x-axis. Whereas, results for FB and RP indicate that these are more affected and have an issue of irrelevant TC selection for execution with redundant fault detection after the execution of TC.

Fig. 11
figure 11

Variation analysis

5.1 Threats to validity

As is typical when evaluating case studies, several threats arise that dispute the theoretical rationality of the results, which necessitates the repetition of the research to approve or refute decisions. The core threats can be categorized as follows (Felderer and Herrmann 2019; Miranda and Bertolino 2018; Ouriques et al. 2018):

  • Internal TV related to factors regarding the arrangement of requirements. To address this threat, mitigation steps must be adopted to avoid using diverse criteria for ordering and selection. Therefore, we used test case change frequency and failed frequency to identify higher fault detection rate ability to reduce irrelevancy and redundancy in test selection and fault detection respectively. Experimental results proved that CTFF has improved fault detection ability and detect maximum faults as earlier as possible.

  • Construct TV considers the connections between the various concepts and reflections. This requires the use of evaluation metrics to assess the validity of the diverse practices in the CTFF model. Therefore, we used APFD metric for prioritization and F measure for selection to estimate the effectiveness of CTFF as compared to other techniques.

  • Conclusion TV relates to the associations between action and consequence. This can be mitigated via a rigorous practical assessment of the various decisions employed in CTFF authentication. Then, a case study can be used to defend the decisions via a qualitative analysis to reduce the bias. All authors participated in evaluation for data collection and counteractions on analysis of results.

  • External TV relates to the generality of the verdicts in real industrial projects. This enhances the validity of the conclusions and fosters more investigation in the relevant domains by replicating the results of the research in diverse situations. Therefore, to avoid this TV, we used three different subjects as a case study for evaluation so that results can be validated in diverse domains.

6 Conclusions and future work

This study presents an approach to resolve challenges in regression testing when supporting continuous integration activities in modern development strategies. Several techniques that have been proposed in the literature failed to increase fault detection and identification rate as they did not exclude redundant faults or irrelevant test cases during execution. In fact, most existing techniques rely on code coverage or historical information when selecting and ranking test cases, which ignores faulty test cases.

To address the limitations in existing methods, we have proposed the CTFF model that ranks and selects test cases by first clustering the test cases that frequently change. In the case of a tie, test cases are prioritized based on the number of frequently failed test cases and coverage criteria. Thus, CTFF improves the regression testing for agile software projects specifically and provides significant implication for a software organization. For the implementation of proposed techniques, we investigated three software application, each application based on different versions.

The outcome of the evaluation shows the following:

  • The proposed technique significantly improved fault detection rate (i.e., more than 90%) at earlier stages as compared to other techniques (less than 50% in RP and FB techniques).

  • The results of evaluation metrics illustrate that the proposed technique significantly outperforms (i.e., between 91 and 97%) as compared to other existing techniques (i.e., between 52 and 68%) to avoid irrelevancy and redundancy of test cases and faults respectively.

  • The results also described that frequently change and failed test case criteria in regression testing is significant for fault detection, reduces irrelevancy in test case selection, redundancy in faults, and reduces the need of maintaining large historical information.

Furthermore, the empirical evaluation results show that the CTFF model has a high fault detection rate as well as the ability to identify the maximum number of faults rapidly. Additionally, a limitation of our study is that we presented different case studies for evaluation based on already collected datasets and results are not statistically verified which are usually not adopted in a case study. Therefore, to mitigate this limitation, we need experimental evaluation to analyze data using statistical analysis for reliability.

In future work, we are planning to extend the CTFF model to resolve regression testing constraints in component-based software and product line engineering applications. We are also planning to investigate additional research questions for controlled experiment-based evaluation for future research. For instance, we may find a correlation among different metrics to guide quality engineers and researchers for selection of the best suitable regression testing techniques in different scenarios and environment for optimal reliability analysis.