A comparison of reinforcement learning frameworks for software testing tasks

Nouwou Mindom, Paulina Stevia; Nikanjam, Amin; Khomh, Foutse

doi:10.1007/s10664-023-10363-2

A comparison of reinforcement learning frameworks for software testing tasks

Published: 24 August 2023

Volume 28, article number 111, (2023)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

A comparison of reinforcement learning frameworks for software testing tasks

Download PDF

Paulina Stevia Nouwou Mindom¹,
Amin Nikanjam¹ &
Foutse Khomh¹

534 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Software testing activities scrutinize the artifacts and the behavior of a software product to find possible defects and ensure that the product meets its expected requirements. Although various approaches of software testing have shown to be very promising in revealing defects in software, some of them lack automation or are partly automated which increases the testing time, the manpower needed, and overall software testing costs. Recently, Deep Reinforcement Learning (DRL) has been successfully employed in complex testing tasks such as game testing, regression testing, and test case prioritization to automate the process and provide continuous adaptation. Practitioners can employ DRL by implementing from scratch a DRL algorithm or using a DRL framework. DRL frameworks offer well-maintained implemented state-of-the-art DRL algorithms to facilitate and speed up the development of DRL applications. Developers have widely used these frameworks to solve problems in various domains including software testing. However, to the best of our knowledge, there is no study that empirically evaluates the effectiveness and performance of implemented algorithms in DRL frameworks. Moreover, some guidelines are lacking from the literature that would help practitioners choose one DRL framework over another. In this paper, therefore, we empirically investigate the applications of carefully selected DRL algorithms (based on the characteristics of algorithms and environments) on two important software testing tasks: test case prioritization in the context of Continuous Integration (CI) and game testing. For the game testing task, we conduct experiments on a simple game and use DRL algorithms to explore the game to detect bugs. Results show that some of the selected DRL frameworks such as Tensorforce outperform recent approaches in the literature. To prioritize test cases, we run extensive experiments on a CI environment where DRL algorithms from different frameworks are used to rank the test cases. We find some cases where our DRL configurations outperform the implementation of the baseline. Our results show that the performance difference between implemented algorithms in some cases is considerable, motivating further investigation. Moreover, empirical evaluations on some benchmark problems are recommended for researchers looking to select DRL frameworks, to make sure that DRL algorithms perform as intended.

Faults in deep reinforcement learning programs: a taxonomy and a detection approach

Article 20 December 2021

Knowledge-enhanced software refinement: leveraging reinforcement learning for search-based quality engineering

Article 25 June 2024

A deep reinforcement learning technique for bug detection in video games

Article 04 August 2022

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Software bugs and failures are costing trillions of dollars every year to the global economy according to a recent report by a software testing company Tricentis.^{Footnote 1} In 2017 alone, 606 software bugs costed the global economy about $1.7 trillion dollars, affecting 3.7 billion people. To alleviate this issue, researchers and practitioners have been striving to develop efficient testing techniques and tools, to help improve the reliability of software systems before they are released to the public. Several strategies, such as random testing by Hamlet and Maciniak (1994), coverage-based testing by Zhu et al. (1997) and search-based testing by Harman et al. (2015) have been proposed to evaluate that a software product does what it is supposed to do. More recently, Deep Reinforcement Learning (DRL) is being increasingly leveraged for software testing purposes as studied by Zheng et al. (2019), Bagherzadeh et al. (2021), Moghadam et al. (2021), Malialis et al. (2015) thanks to the availability of multiple DRL frameworks providing implemented DRL algorithms, e.g., Advantage Actor Critic (A2C), Deep Q-Networks (DQN), Proximal Policy Optimization (PPO). For example, Kim et al. (2018) leveraged the Keras-rl framework to apply DRL to test data generation. Similarly, Drozd et al. (2018) used the Tensorforce framework to apply DRL to Fuzzing testing and Romdhana et al. (2022) used the Stable-baselines framework for black box testing of android applications.

However, given that these implemented DRL algorithms often make assumptions that could hold only for certain types of problems and not for others, it could be challenging for developers and researchers to select the most adequate DRL implementation for their problem. The choice of a DRL algorithm depends on the nature of the problem to solve, the available computation budget, and the desired generalizability of the trained models. Moreover, given the fact that DRL algorithms are often implemented differently in different DRL frameworks, it is unclear if the same results can be obtained using different frameworks.

To clear up these interrogations and help researchers and practitioners make informed decisions when choosing a DRL framework for their problem, in this paper, we examine and compare the applicability of different DRL frameworks for software engineering testing tasks. Specifically, we apply DRL algorithms from different frameworks to game testing and test case prioritization. The automation of game testing is critical because of the frequent requirements changes that occur during a game development process as studied by Santos et al. (2018). Recently, Yang et al. (2018, 2019), Koroglu et al. (2018), Adamo et al. (2018) applied different DRL algorithms to automate game testing and improve the fault identification process. Test case prioritization improves the testing process by finding optimal ordering of the test cases and detecting faults as early as possible. Bertolino et al. (2020), Spieker et al. (2017) successfully applied DRL in prioritizing test cases for various configurations. Moreover, as these tasks have gained a lot of attention recently, by studying them we can provide meaningful results that can be used by the software engineering community.

In this paper, we perform a comprehensive comparison of different DRL algorithms implemented in three frameworks, i.e., Stable-baselines3 (Raffin et al. 2021), Keras-rl (Plappert 2016), and Tensorforce (Schaarschmidt et al. 2018). We investigate which DRL algorithms/frameworks may be more suitable for detecting bugs in games and solving the test case prioritization problem. Results show that the diversity of hyperparameters that each framework provide impacts its suitability for each of the studied software testing tasks. Given some algorithms, the Tensorforce framework tends to be more suitable for detecting bugs as it provides hyperparameters that allow a deeper exploration of the states of the environment while the Stable-baselines framework tends to be more suitable for the test case prioritization problem.

To summarize, our work makes the following contributions:

To evaluate the usefulness of DRL on game testing, we utilized three state-of-the-art DRL frameworks: Stable-baselines, Keras-rl, and Tensorforce. Specifically, we applied them to the Block Maze game for bug detection and collected the number of bugs, the state coverage, the code coverage, the cumulative reward, the average training and prediction times. We have compared a total of seven DRL configurations and some of them outperform the existing work.
Based on eight publicly available datasets, we applied state-of-the-art DRL frameworks on two ranking models and collected results to evaluate their usefulness in prioritizing test cases. As metrics of comparison, we consider the Normalized Rank Percentile Average (NRPA), the Average Percentage of Faults Detected (APFD), the average training and prediction times for each DRL configuration. The results collected are compared with the baselines and we derive conclusions regarding the most accurate DRL frameworks for test case prioritization. We found out that in most datasets, the Stable-baselines framework originally used by Bagherzadeh et al. (2021) performs better than Tensorforce and Keras-rl.
We provide some recommendations for researchers looking to select a DRL framework as we noticed differences in performance when considering the same algorithm among different frameworks. For example, the same DQN algorithm from different frameworks show different results.

The rest of this paper is organized as follows. In Sect. 2, we review the necessary background knowledge on the game testing problem, the test case prioritization problem, and DRL. The methodology followed in our study is described in Sect. 3. We discuss the obtained empirical results in Sect. 4. Some recommendations for future work are mentioned in Sect. 5. We review related work in Sect. 6. Threats to validity of our study are discussed in Sect. 7. Finally, we conclude the paper and discuss some future works in Sect. 8.

2 Background

In this section, first we introduce DRL and present some state-of-art DRL frameworks. Secondly, we describe the terms and notations used to define the test case prioritization and game testing problems.

2.1 Deep Reinforcement Learning

A DRL agent interacts with the environment that can be modelled as a Markov decision process $(\mathcal {S,A,P,\gamma })$ with the following components:

State of the environment: A state $ s \in \mathcal {S} = \mathbb {R}^n $ represents the agent perception of the environment.

Action: Based on the observation (i.e., state of the environment), the agent chooses among available actions in $\mathcal {A}$.

State transition distribution: $\mathcal {P} = \mathcal {P}(s_{t+1},r_{t}|s_{t},a_{t}) ~~ a_{t} \in \mathcal {A}$, defines the probability of the agent to move to the next state $s_{t+1}$, performing action $a_{t}$ receives $r_{t}$ as reward given that it is in state $s_{t}$. The goal of the agent is to maximize the expected rewards discounted by $\gamma $. To make the decision to move to a state given its observation, the DRL agent follows a policy $\pi : \mathcal {S} \rightarrow \mathcal {A}$ which is a mapping from $\mathcal {S}$ to $\mathcal {A}$.

Episode: An episode is a sequence of states of the environment, actions performed by an agent and rewards (an incentive mechanism that tells the agent about the effectiveness of the action) which ends when the agent has reached a terminal state or has reached a maximum number of steps.

Policy. Given an agent, a policy $\pi $ is defined as a function $\pi :S \rightarrow A$ mapping each state $s \in S$ to an action $a \in A$. The policy indicates the agent’s decision in each state of the underlined task. It can be a strategy from a human expert or learned from experiences accordingly.

DRL algorithms can be classified based on the following properties similar to the work by Bagherzadeh et al. (2021):

Model-based and Model-free DRL. In model-based DRL, the agent knows the environment. It knows in advance the reaction of the environment to possible actions and the potential rewards it will get from taking each action. During training, the agent learns the optimal behavior by taking actions and observing the outcomes which include the next state and the immediate reward. On the contrary, in model-free DRL, the agent has to learn the dynamics of the environment by interacting with it. From the interaction with the environment, the agent learns an optimal policy for selecting an action. In this work, we are only interested in model-free DRL algorithms as some of the test case features (execution time) are unknown beforehand as well as the location of faults in a game.

Value-based, policy-based, and actor-critic learning. At every state, value-based methods estimate the Q-value and select the action with the best Q-value. A Q-value shows how good an action might be given a state. Regarding policy-based methods, an initial policy is parameterized, then during training, the parameters are updated using gradient-based or gradient-free optimization techniques. Regarding actor-critic methods, the agent learns simultaneously from value-based and policy-based techniques. The policy function (actor) selects the action and the value function (critic) estimates the Q-values based on the action selected by the actor.

Action and observation space. The action space indicates the possible moves of the agent inside the environment. The observation space indicates what the agent can know about the environment. The action and observation space can be discrete or continuous. Specifically, the observation space can be a real number or high dimensional. While a discrete action space means that the agent chooses its action among distinct values, a continuous action space implies that the agent chooses actions among real values vectors. Not all DRL algorithms support discrete and continuous configurations for both the action and observation space, which limits the choice of algorithms to implement.

On-policy vs Off-policy. On-policy methods will collect data that is used to evaluate and improve a target policy and take actions. On the contrary, Off-policy methods will evaluate and improve a target policy that is different from the policy used to generate the data. Off-policy learners generally use a replay buffer to update the policy.

DRL methods use Deep Neural Networks (DNNs) to approximate the value function, or the model (state transition function and reward function) and tend to be a more manageable solution space in large complex environments.

2.2 State-of-the-Art DRL Frameworks

In recent years, Lillicrap et al. (2015), Mnih et al. (2016) introduced multiple model-free DRL algorithms; advancing the research around DRL. Different DRL frameworks such as Stable-baselines (Raffin et al. 2021; Hill et al. 2018) and Tensorforce by Schaarschmidt et al. (2018) have also been introduced to ease the implementation of DRL-based applications. These frameworks usually contain implementations of different DRL algorithms. While the developers may implement their own algorithm, in this work, we focus on comparing the implemented algorithms of existing DRL frameworks on software testing tasks. Table 1 provides a list of popular DRL frameworks, which are described below.

OpenAI baselines (Dhariwal et al. 2017) is the most popular DRL framework given its high GitHub star rating. It provides many state-of-the-art DRL algorithms. After installing the package, training a model only requires specifying the name of the algorithm as a parameter.
Stable-baselines (Raffin et al. 2021; Hill et al. 2018) is an improved version of OpenAI baselines with a more comprehensive documentation. In this paper, we used the version 3 of this framework, which is reported to be more reliable because of its Pytorch (Paszke et al. 2019) backend that captures DNN policies. To train an agent, Stable-baselines has built-in functions that create a model depending on the DRL algorithm chosen.
Keras-rl (Plappert 2016) provides the dueling extension of the DQN algorithm and SARSA algorithm that are not offered by Stable-baselines version 3. However, Keras-rl offers less algorithms than the previous frameworks. The training of an agent requires a few steps: the definition of the DNN that will be used for the training, the instantiation of the agent, its compilation, and finally the call of the training function.
Tensorforce (Schaarschmidt et al. 2018) provides the same algorithms as the Stable-baselines framework with some additions: Trust-Region Policy Optimization (TRPO), Dueling DQN, Reinforce, and Tensorforce Agent (TA). Tensorforce offers built-in functions to create and train an agent. Also, it offers the flexibility to train the agent without using the built-in functions, which allows it to capture the performance metrics of the agent, such as the reward. Finally, the training starts in a loop function depending on the number of episodes. Tensorforce relies on TensorFlow (Abadi et al. 2015) as backend.
Dopamine (Castro et al. 2018) is a more recent framework that proposes an improved variant of the Deep Q-Networks (DQN) and the Soft Actor-Critic (SAC) algorithm. In addition to a TensorFlow backend for creating DNNs, Dopamine is configured using the gin^{Footnote 2} framework, to specify and configure hyperparameters. The training of an agent requires instantiating the model and then starting the training with built-in functions.

Table 1 Popular DRL frameworks

A comparison of reinforcement learning frameworks for software testing tasks

Abstract

Similar content being viewed by others

Faults in deep reinforcement learning programs: a taxonomy and a detection approach

Knowledge-enhanced software refinement: leveraging reinforcement learning for search-based quality engineering

A deep reinforcement learning technique for bug detection in video games

Explore related subjects

1 Introduction

2 Background

2.1 Deep Reinforcement Learning

2.2 State-of-the-Art DRL Frameworks

2.3 Game Testing

Definition 1

Definition 2

Definition 3

Definition 4

2.4 Test Case Prioritization

Definition 5

Definition 6

Definition 7

Definition 8

3 Study Design

3.1 Research Questions

3.2 Problem 1: Game Testing Using DRL

3.2.1 Creation of the DRL Environment

Observation space:

Action space:

Reward function:

3.2.2 Experimental Setup

3.2.3 Training of a DRL Agent

3.2.4 Datasets

3.2.5 Evaluation Metrics

3.2.6 Analysis mMethod

3.3 Problem 2: Test Case Prioritization Using DRL

3.3.1 Creation of the DRL Environment

Pointwise ranking function

Pairwise ranking function

3.3.2 Experimental Setup

3.3.3 Comparison Baselines

3.3.4 Training of a DRL Agent

3.3.5 Integration of a DRL Agent into CI Environments

3.3.6 Datasets

3.3.7 Evaluation Metrics

Normalized Rank Percentile Average (NRPA)

Average Percentage of Faults Detected (APFD)

Training time

Prediction time

3.3.8 Analysis Method

3.4 Data Availability

4 Experimental Results

4.1 Game Testing

RQ1:

RQ2:

RQ3:

4.2 Test Case Prioritization

5 Recommendations About Frameworks/Algorithms Selection

6 Related Work

7 Threats to Validity

Conclusion validity

Internal validity

Construct validity

External validity

Reliability validity

8 Conclusion and Discussions

Data Availability

Notes

References

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search