1 Introduction

The emergence of numerous project management tools and approaches have attributed to the increased project complexity and team-based initiatives. The use of bug tracking tools is an important aspect of open-source project management. Bug reports and their debugging procedures have become an unavoidable part of software development during the previous few decades (Kamkar 1998). Software developers work hard to ensure that the software entity is bug-free (Fagan 2002). However, in actuality, a high number of defect reports are encountered by any software. For collecting, organising, and monitoring of incoming bug reports, large software companies use bug tracking systems (Breu et al. 2010). Bug tracking systems (BTS) are also termed as issue tracking systems (ITS), hence, they have been used interchangeably in this paper. Past works have produced a multitude of research endeavours to ensure genuine treatment of bugs. Out of these, most works are concerned with bug summary generation (Gupta and Gupta 2021), meta-field prediction (such as severity, priority, etc.) (Kumari et al. 2020; Sharma et al. 2021), duplicate identification (Neysiani et al. 2020; Isotani et al. 2021), developer recommendation (Goyal and Sardana 2016; Ye et al. 2020), reopening (Tagra et al. 2021), fixing time (Lee et al. 2020; Kumari et al. 2020), localization (Li et al. 2021) etc. These methods collect numerous bug report parameters/ meta-fields from bug repositories and use them to create the prediction models for certain tasks.

Fig. 1
figure 1

The roadmap for article

In basic scenario, various stakeholders of BTS (such as end users, developers, and testers) file the problems found to BTS (Anvik 2006). Bug triager checks the bug’s existence and if found as valid, assigns the bug to the developer (Zhang et al. 2014). The developer uses information supplied by reporter in the bug report to reproduce the problem (Shokripour et al. 2015). However, if the developer is unable to replicate the bug, it is designated as Non-Reproducible (NR) (Joorabchi et al. 2014). In bug repositories, NR bug reports are a significant performance issue since they occupy a significant amount of developer’s time and effort. NR bugs create delay in bug fixing and they may even lead to the release of software project with critical bugs (Rahman et al. 2020). Hence, the detection of NR bugs in early bug life cycle is an open research problem requiring investigation.

Joorabchi et al. (Joorabchi et al. 2014) published first characterization study on NR bugs. They addressed four research questions related to quantitative and qualitative analysis of NR bugs. They manually mined the cause categories and transition patterns of about 1600 NR bugs. Further, they studied the NR bugs which eventually got fixed. After conducting an exploratory investigation on 6 bug tracking repositories, they discovered that 17% of all bug reports are resolved as NR. The cause categories for 1,643 NR bugs are defined as Interbug Dependencies (45%), Environmental Differences (24%), Insufficient Information (14%), Conflicting Expectations (12%), Non-deterministic Behaviour (3%) and Others (2%). Furthermore, only around 2% of all NR bug reports get fixed with code fixes in the end, while the other half are implicitly repaired. This work puts some light on the factors leading to make bugs NR, however, it does not provide any mitigation strategy. It does not provide any mechanisms to improve the bug fixing process. Further, (Goyal and Sardana 2017) presented a sentiment analysis based study of developers who worked on NR issue fixes. They discovered that developer comments posted in NR bug reports are more negative than standard defects. Machine learning classifiers are then used to forecast fixable issues from NR flagged bugs. Our work is different from this work as we do not study developer sentiments as bug reports are technical documents and they constitute technical keywords which lack any kind of sentiment. Secondly, the prediction model proposed by Goyal and Sardana (2017) deals with the prediction of reopened bugs whereas our work deals with the prediction of new bugs. Hence, the work presented in this paper attempts to fill the research gap present in the literature “ to provide a mitigation strategy to early predict the NR bugs”.

To the best of our knowledge, there does not exist any work on early prediction of NR bugs. A unique NRPredictor framework is provided in this paper to forecast the fixability of bug reports. For fixability prediction, the proposed model combines feature selection and ensemble learning methods. Ensemble-based approaches use the capabilities of several different basic classifiers to improve classification accuracy (Alzubi 2015). In this method, the training data is first separated into many disjoint groups, and then each subset is trained using a base classifier. Feature selection algorithms try to reduce the complexity of the system.

Fig. 2
figure 2

An example of a NR bug report

The following are the current work’s key research contributions (RC):

  1. 1.

    The early fixability problem in bug reports has been examined. In this RC, the problem of prediction of bug type (R or NR) when a new bug is filed to BTS has been examined.

  2. 2.

    A novel framework, NRPredictor, based on feature selection and ensemble machine learning algorithms, has been proposed. In this RC, a novel framework has been proposed which predicts whether a new bug report will get fixed or it will be marked as NR.

  3. 3.

    Thirteen machine learning classifiers (Bayes Net, Naive Bayes, Naive Bayes Multinomial Text, Naive Bayes Updateable, IBk, Zero-R, JRip, OneR, PART, Decision Table, J48, Rep Tree and Random Tree) along with three ensemble learning techniques (Bagging, Boosting and Stacking) and one feature selection technique (Classifier Attribute Evaluator) has been utilized in proposed framework, NRPredictor. In this RC, traditional and advanced machine learning algorithms have been utilized for prediction of a newly reported bug as Fixable or NR.

  4. 4.

    The proposed framework, NRPredictor has been tested on three large-scale, well-known, long lived, open-source Bugzilla repository projects, namely NetBeans,Footnote 1 Eclipse,Footnote 2 and Mo-zilla Firefox.Footnote 3 In this RC, bug reports from three long lived software projects have been collected and processed to be fed into NRPredictor framework for prediction purposes.

  5. 5.

    Four evaluation metrics (Precision, Recall, F1-Score, Area under Receiver Operating Ch-aracteristic Curve) have been used for comparison. The experimental findings reveal that the proposed framework, NRPredictor outperforms traditional machine learning techniques consistently. F1-scores up to 88.3, 87.8 and 87.4% for Mozilla Firefox, Eclipse and NetBeans projects has been obtained respectively. In this RC, performance evaluation of proposed framework is conducted using various performance evaluation metrics.

The paper is organised as per the roadmap defined in Fig. 1. Section 2 goes through the background information which includes NR bug report structure, the bug report life cycle, and the ensemble and feature selection approaches used in this paper. The relevant past work across three domains (reproducibility, prediction and ensemble techniques) is discussed in Sect. 3. The architecture of proposed NRPredictor framework is detailed in Sect. 4. The experimental details are presented in Sect. 5. The results and analysis of the experimental evaluation are presented in Sect. 6. The risks to validity are discussed in Sect. 7. Finally, Sect. 8 brings the work to a close by providing conclusion. Section 9 discusses future research prospects.

Fig. 3
figure 3

Life-cycle of a typical bug report in Bugzilla Repository

2 Background

This section covers the necessary background information for this research, such as the fundamental layout of a bug report, the normal life-cycle of an issue, and various ensemble learning & feature selection methodologies used.

2.1 Bug report structure

A bug report is a record that contains complete information concerning a problem. It contains a number of bug meta-fields as well as some textual material. Bug id, product, component, platform, hardware, version, operating system, severity and priority, milestone, status, resolution, reporter’s name, time-stamp of report submission, assignee, and so on are all included in the meta-fields. A quick summary or tagline, a detailed explanation of the error, and comments provided by the reporter, developer, or testers are all included in the textual information. Figure 2 displays an example of an Eclipse Project’s NR bug report (Bug id: 13747).Footnote 4

In Fig. 2, the unique serial number assigned for every problem is referred to as the "Bug ID". The term "Product" refers to the wide region from which the bug sprang. The term "Component" refers to the product’s next level of categorisation. One or more components can be found in a single product. The term "Version" refers to the software product version in which a defect was discovered. The "Status" parameter indicates where the bug is in its life cycle. The name of the developer who has been assigned task for fixing the fault is referred to as "Assigned-to." The term "Summary" refers to a one-sentence explanation of the reported defect. "Description" refers to the bug report’s whole comprehensive specification, which is often written by reporter. Description usually consists of 3 main elements: noticed behaviour, reproducible processes, and predicted software behaviour (Chaparro et al. 2017). The term "Comments" refers to an open-ended discussion among developers to find viable remedies for bug solving.

Along with particular meta-fields and textual contents, bug report contains attachments, URLs, and automatically produced notes. Extra information about the problem is commonly included in these columns, like test cases, patch filed, user-supplied screen shots, the URL of website containing issue, similar duplicate bugs, and so on.

2.2 Bug life-cycle

A bug progresses via various phases throughout its existence. Figure 3 shows life-cycle of a bug report in Bugzilla repository.Footnote 5 For different projects, life-cycle stages may vary slightly but the mainstream order remains same. Initially, any bug’s existence is UNCONFIRMED. A bug reporter has reported the problem thus far, but its existence has yet to be validated. The existence of an unconfirmed issue is confirmed by the bug triager, who then labels the validated bug as NEW. Because it is presumed that a bug submitted by an expert is real and existent, it may reach NEW state immediately. The bug triager assigns a verified bug to the developer and labels the resolution with ASSIGNED. The allocated developer investigates the problem, reproduces it, and performs appropriate modifications for fixing it.

There are numerous bug report resolutions available in the RESOLVED status, including fixed, duplicate, won’t fix, worksforme (NR), invalid, remind, and later. The resolution of the problem is indicated as fixed once the assigned developer has successfully made relevant source code adjustments. However, the assigned developer does not have to always discover a valid remedy to the reported issue. A software developer may discover that the claimed problem is not unique when investigating a bug report. It might be a duplicate of an existing or fixed problem, or it could have the same basic cause as another bug. In this case, the bug’s resolution is marked as duplicate (Sureka and Jalote 2010). The resolution of a bug report that outlines a non-rectified issue is set as won’t fix. The problem is marked as NR or worksforme if it cannot be recreated using the information given in the bug report. When additional information is added to the NR bug, it may be reopened, which may aid in replicating the problem. A bug is marked as resolving invalid when it is proven to be illegible or spam. Invalid bugs are considered as not real problems (Yuan et al. 2021). Bugs that force third-party software or websites to make changes, for example, constitute a breach of legal and contractual obligations. If bug requires further information and cannot be addressed immediately, then it is marked with resolution remind or later (Abou et al. 2021).

Table 1 Review of past works on non-reproducible bugs

2.3 Ensemble learning/ classification

The classification refers to division or category in a system that organises or categorises objects. Initially, manual classification of items was popular. Manual classification, on the other hand, has the drawbacks of being exceedingly time-consuming and fundamentally subjective in nature (Bauer et al. 1999; Alzubi et al. 2018). As a result, automatic classification algorithms were developed. Automatic classification is more objective, quicker, and scalable. It can be effective in more complicated, nuanced circumstances, such as business-specific material, because it provides companies with a more systematic and consistent classification. Artificial intelligence techniques are excessively used in computing for training, forecasting and evaluation purposes (Movassagh et al. 2021). Automatic document categorization can benefit from machine learning and artificial intelligence techniques to improve speed and efficiency.

Ensemble classification approaches are a type of meta machine learning algorithm that has recently gained popularity. To improve predictive performance, these strategies aggregate predictions from different learning algorithms (Dietterich et al. 2000). Distinct machine learning classifiers have different fundamental principles and training data sensitivity. As a result, various categorization systems make different predictions based on the data. These various outcomes are used by ensemble machine learning algorithms to produce a superior prediction output (Alzubi et al. 2020). These strategies aim to reduce prediction model bias and variance while also attempting to improve prediction accuracy using only one of the constituent learning algorithms. Three alternative ensemble classification approaches were investigated in this paper.

  1. 1.

    Bagging: Bagging also referred as "Bootstrap Aggregating" is a meta-estimator that uses several random subsets of the original dataset to fit a base classifier. The original dataset is re-sampled via replacement, and the predictions of several learners are combined for generating final result. Breiman demonstrates bagging approach is helpful for unstable learners (Breiman 1996).

  2. 2.

    Boosting: This approach combines various weak classifiers to produce a powerful classifier. If the model has a large error rate, it is deemed weak (0.5 or more for binary classification). The ensemble classifier is constructed to reduce the mistakes obtained in the previous step throughout each iteration. Iterations are repeated till the point of maximum iterations or till whole training dataset is correctly predicted (Freund and Schapire 1995).

  3. 3.

    Stacking: Stacked generalisation, is an ensemble strategy which uses a training dataset to train several base classifiers and then uses these base classifiers to build a new dataset. Then, using combiner machine learning approach, this new dataset is incorporated (Wolpert 1992).

2.4 Feature selection techniques

Raw machine learning data is made up of a variety of attributes, some of which are useful for making predictions and others that aren’t. Feature selection approaches assist in identifying a set of relevant traits from a large number of options. The Classifier Attribute Evaluator was used to select features in this paper. The attribute evaluator is a tool for evaluating each attribute (also known as a column or feature) in your dataset in relation to the output variable (e.g. the class).

3 Literature review

Since the previous two decades, the study of software flaws has been a hot topic of research. (Perry and Stieg 1993) presented a preliminary research on the investigation of reported problems in major software projects. The authors performed a poll to find out what kinds of difficulties users report, how they are discovered, and at what point of testing they are filed to BTS. Since then, various studies have been done that examine different buggy locations. This section goes into previous research in these buggy domains, which are divided into three categories: reproducing bug reports, prediction models in bug fixing, and ensemble learning in bug fixing.

3.1 Reproducing bug reports

A bug report comprises 3 key elements: procedures for replication of problem, what reporter anticipated to observe, and what reporter actually observed (Chaparro et al. 2017). The above listed 3 elements aid software developers in verifying, finding, and replicating the problematic scenario, as well as understanding the fundamental cause of the fault. After that, the allocated developer fixes the problem by making modifications to the source code. Reproducing a bug report is famously difficult since engineers are only given limited information about the failure, such as a memory dump. ReCrash is an automated approach to construct testcases for simulating a software failure introduced by Artzi et al. (2008). CRASHDROID was created by White et al. (2015) for Android apps to automate system of replicating problems. Jin and Orso (2012) established BugRedux, an approach to gather extra information from the buggy ground and transmits the collected information to developers for repeating the failure circumstance. Despite the fact that studies exist to assist developers in recreating problem reports, their in-field performance is quite poor. RecrashJ, the Java version of ReCrash, for example, has a performance overhead of 13–64%.

If a developer is unable to replicate an issue, the resolution is marked as NR. It is often perplexing and time-consuming for engineers to manage NR problems. An empirical analysis over 32,000 NR bugs was given by Joorabchi et al. (2014) which discovered that resolution NR is assigned to 17% of all bug reports, and that just 3% of NR-assigned bug reports get repaired. Further, 1,643 NR issue reports are manually sorted into six different cause groups, including inter-bug dependencies, environmental differences, insufficient information, conflicting expectations, non-deterministic bugs, and others. Goyal and Sardana (2017) did a sentiment study of developers who worked on NR issue fixes. They discovered that developer comments posted in NR bug reports are more negative than standard defects. Machine learning classifiers are also used to forecast fixable issues from NR flagged bugs.

Table 1 presents the review of literature in the broad domain of NR bugs. From Table 1 it has been observed that these works put some light on the factors leading to make bugs NR, however, they do not provide any concrete mitigation strategies. Joorabchi et al. (2014) does not provide any mechanisms to improve the bug fixing process. Goyal and Sardana (2017) presented a sentiment analysis based model to forecast fixable issues from NR flagged bugs. However, their work deals with the prediction of re-opened bugs whereas the current manuscript deals with the prediction of new bugs. Hence, the work presented in this paper attempts to fill the research gap present in the literature “to provide a mitigation strategy to early predict the NR bugs".

Table 2 Review of past works on prediction models in bug fixing

3.2 Prediction models in bug fixing

In the arena of software engineering research, debugging is a well-known concept. A lot of research and prediction models have been established for bug summarization (Koh et al. 2021), bug triaging (Mohan et al. 2016; Goyal and Sardana 2017), duplicate detection (Rocha and Carvalho 2021), fix time prediction (Yuan et al. 2021), blocking bug prediction (Cheng et al. 2020), reopened bugs (Shihab et al. 2013), etc. Garcia and Shihab (2014) developed a bug-blocking prediction model. They inspected the performance of decision tree, Naive Bayes, kNN, random forest, and Zero-R classifiers using 14 bug attributes to discriminate between blocking and non-blocking bug reports. Using 10-fold cross validation, they were able to reach an F-measure of \(15-42\%\). Xia et al. (2015) expanded on this research to address the problem of class imbalance in the blocking problem prediction. It has been found that ensemble learners successfully operate upon the phenomena of class imbalance and, as a result, may increase minority class prediction accuracy. Our forecasting is based on the same principles. Shihab et al. (2013) took care of the reopened bug reports. For bug report classification, they employed 22 distinct characteristics divided into four categories: developer work habits, bug report, problem fix, and team. While predicting reopened bugs, they reported accuracy values 52.1–78.6% and recall values 0.5–94.1% for reopened bug finding. Comment text and last status were discovered to be the most important elements.

Fig. 4
figure 4

Architecture of proposed framework: NRPredictor

Hewett and Kijsanayothin (2009) built a model that anticipated how long it would take to fix software bugs. On a medical software system dataset, the suggested model has an accuracy of 93.44%. Guo et al. (2010) presented an architecture for predicting fixability of a freshly discovered problem. On Microsoft Windows Vista project, the suggested model achieved accuracy values of up to 68% and recall values of up to 64%. Zimmermann et al. (2012) analysed and evaluated reopened bug complaints to determine likely causes of reopening and to assess their effect. Table 2 presents the review of literature related to prediction models in different phases of bug handling. Meta-heuristic algorithms such as AHP, TOPSIS, etc. are also used in bug handling processes nowadays (Goyal and Sardana 2017). Research is in progress to further optimise the meta-heuristic algorithms (Agushaka et al. 2022; Abualigah et al. 2021, 2022; Oyelade et al. 2022; Abualigah et al. 2021; Sethuraman et al. 2019).

Various studies have showed that machine learning classifiers are successful in predicting different buggy locations(Garcia and Shihab 2014; Ahmed et al. 2021; Malhotra et al. 2021; Rashmi and Kambli 2020). Our research focuses on NR defects, as opposed to many prediction models present in past works relating to software debugging procedures. The tests were carried out on bug reports identified with the classes Reproducible (R) and NR. The goal is to forecast which bug reports can be fixed.

3.3 Ensemble learning in bug fixing

In the literature of machine learning classification, ensemble learning plays an essential role. Numerous ensemble based classifiers have been suggested to increase the performance of a traditional machine learning classifiers (López et al. 2013). The effectiveness of ensembling technique may be attributed to the diversity of their base learners (Guo et al. 2008). As a result, ensemble classifiers use a group of basis classifiers to build a prediction model. There are many different forms of ensemble models, e.g., bagging (Breiman 1996), boosting (Freund and Schapire 1995), stacking (Wolpert 1992), etc. The bagging approach uses the same basic classifier to train several classifiers, which are then combined using an unweighted majority voting mechanism. The majority of votes determines the final forecast. Bagging approaches typically outperform single model algorithms by a wide margin. It is never considerably insufficient since it mitigates the classifiers’ volatility by raising the victory proportion (Phua et al. 2004). Boosting is an iterative method which provides a weight value to the training dataset in each iteration. During 1st run, all weights are put equal (Freund and Schapire 1995). The weights of improperly categorised instances are raised with each repetition. This helps weak learners to concentrate on the training set’s difficult cases. Using a meta-classifier like Logistic Regression, stacking combines numerous classifiers (Wolpert 1992). Multiple basic classifiers are used to classify a single test case. The output of these several basic classifiers is fed into a meta-classifier, which produces the ultimate prediction.

In the literature on software debugging, ensemble learners have been employed in a number of studies. Stacking ensemble approach for automated bug triaging was assessed by Jonsson et al. (2016). Stacking beats standard machine learning techniques for the multi-class issue of developer selection, according to their findings. Goyal and Sardana (2019) provided an empirical study on bug triaging strategies using ensemble classification algorithms (bagging, boosting, majority voting, average voting, and stacking). Ensemble classifiers outperformed standard machine learning algorithms in the identification of an appropriate developer to handle the bug report, according to the researchers.

Table 3 Feature/ parameter list used in NRPredictor framewore

Limsettho et al. (2018) presented a SMOTE-based technique to predict cross-project defects. Laradji et al. (2015) found that ensemble learning had a favourable influence on software fault prediction. Xia et al. (2015) proposed ELBlocker, an ensemble learning approach for predicting blocking problems. They demonstrated that orthodox machine learning methods perform poorly for severely unbalanced datasets, but ensemble-based strategies aid in the development of stronger prediction models. They found that employing an ensemble-based method improved F1-Score by 14.69% when compared to traditional machine learning classifiers. Lal, et al. (2017) introduced an ensemble-based model for logging predictions called ECLogger. To deal with the problem of class imbalance, they employed bagging and voting ensemble approaches. Using ensemble-based approaches, they were able to obtain better prediction performance. Motivated by these studies (Jonsson,Borg,Broman,Sandahl,Eldh and Runeson, 2016; Lal et al., 2017; Xia,Lo,Shihab,Wang and Yang, 2015) the issue of fixability prediction has been addressed. Ensemble based techniques along with feature selection techniques has been used to predict the fixable and NR bugs optimally.

4 NRPredictor framework

This section describes the architecture of proposed framework, NRPredictor as shown in Fig. 4. The framework, NRPredictor, is divided into two phases: model building and prediction.

PHASE 1: model building phase

Initially, previous bug reports from the bug repository with known labels (R or NR) are used as input. Following that, different characteristics are retrieved and pre-processing techniques are used. Machine learning is then used to learn multiple models based on the retrieved information. Finally, this step generates hybrid models for forecasting the class for bug reports that haven’t been tagged. The following are the phases in model building phase of proposed framework, NRPredictor:

Fig. 5
figure 5

General procedure of proposed framework: NRPredictor

  1. 1.

    Dataset acquisition: First, bug reports with class labels (R and NR) are collected from various projects of Bugzilla repositories in this stage. A bug report is considered as NR if it is marked as NR or "worksforme" during the life.

  2. 2.

    Feature extraction: Next, 9 different features are drawn out from the collected bug reports. Among the 9 features, 8 are numerical (component, severity, priority, operating system, hardware, version, number of comments and cc count) and 1 is textual in nature. All of the features along with their descriptions which are utilised in this paper are listed in Table 3.

  3. 3.

    Data pre-processing: The bug reports’ textual contents are evaluated in this stage to build a feature vector containing critical keywords. This stage entails cleaning up the contents of the text. The bug report’s textual description is received first. The phrases are then tokenized, with all concatenated terms being broken up and further their case changed to the lower version. Stop words have been eliminated because their frequency is higher but they do not constitute any information. Python Natural Language Processing ToolKit (NLTK) has been used to leverage the stop word list. Porter stemming (supplied by Python NLTK) is then used to transform the remaining tokens to their root phrase. Stemming is the process of combining closely similar phrases (for example, stemming converts the two words likes and likely to root terms like).

  4. 4.

    Feature vector creation: Next, the frequency of each token has been determined after data preprocessing, and a textual feature vector has been constructed. In this stage, the top 100 tokens derived from textual characteristics with the highest overall frequency were considered for textual features.

  5. 5.

    Feature selection/ reduction: Feature selection techniques help to obtain a set of relevant features from the list of all available features. In this work, Classifier Attribute Evaluator has been used for feature selection. The attribute evaluator is a method for evaluating each attribute in your dataset (also known as a column or feature) in relation to the output variable (e.g. the class).

  6. 6.

    Classifier learning: Various NRPredictor models were created in this stage. First, 13 machine learning algorithms (Bayes Net, Naive Bayes, Naive Bayes Multinomial Text, Naive Bayes Updateable, IBk, Zero-R, JRip, OneR, PART, Decision Table, J48, Rep Tree and Random Tree) from four families: Bayes, trees, rules and lazy were used to learn the models. Comparison of the performances of 13 machine learning classifiers was used to conduct an empirical investigation. Then, employing 13 machine learning classifiers, three ensemble learning strategies were used. To create prediction models for fixability prediction, three ensemble-based strategies (bagging, boosting, and stacking) are utilised.

PHASE 2: prediction phase

This phase accepts the bug report as input for which the class label has to be anticipated. Then it draws out the bug report’s characteristics, uses pre-processing techniques, and predicts label using the hybrid models created during the model building phase (Phase 1). The following step is involved in the prediction phase of the NRPredictor framework:

Classification In this step, first, 9 features for test bug report are extracted. Then, pre-processing is done by applying tokenization, casing, removal of punctuation, stop word removal and stemming corresponding to the test bug report. The pre-processed feature vector is then supplied to the proposed framework, NRPredictor. On the basis of various learned hybrid models, the test bug report is classified and a label is predicted corresponding to it (label is either R or NR). Next, the predicted label by proposed framework, NRPredictor and the ground truth label is considered and various evaluation metrics are computed to evaluate the prediction performance of learned models and NRPredictor framework.

5 Experimental details

The subject systems, implementation details, assessment measures, and research issues addressed in this paper are detailed in this section. The experimental setup constitutes MacBook Pro with 8 GB of memory and a 2.7 GHz Intel Core i5 processor running Mac OS X 10.13.1. However, few machine learning algorithms were not able to be run in the given platform. The time threshold of 8 hrs has been used. The requirement of high computing power has been considered for all such combinations. Hence, all such algorithm combinations have been run on a GPU equipped with NVIDIA Tesla V100 with 16GB RAM (5120 CUDA Cores). Four such cores in parallel manner have been used for this work. For experimentation, Python programming language has been used with Jupyter Notebook as Integrated Development Environment (IDE). Scikit machine learning library has been used for building conventional and ensemble classification models.

5.1 Subject systems

This research examined bug reports obtained from three projects of Bugzilla BTS: NetBeans, Eclipse, and Mozilla Firefox. NetBeans is an open source development environment, tooling platform, and application framework for creation of programs using modules.Footnote 6 It is a Java application. Eclipse is a framework that contains a number of tools for creating a Java integrated development environment (IDE).Footnote 7 It was introduced in 2001 and is the most extensively used Java IDE. It is primarily written in Java. Mozilla Firefox is an open-source web browser launched in 2002.Footnote 8 The popular web browser is available for a variety of systems. It is written in C++.

Table 4 Distribution of bug reports in R and NR categories

Since a huge number of bug reports are sent on a daily basis, the Bugzilla dataset’s different projects are capable of assessing trends and evaluating recommended techniques. While choosing experimental projects, it was attempted to cover a wide variety of disciplines. These projects have been used frequently in the past studies Anvik and Murphy (2011); Bhattacharya and Neamtiu (2010); Tamrawi et al. (2011). These projects’ popularity and maturity makes it ideal for researching any novel bug-handling techniques. A total of 261551 bug reports were gathered and divided into two groups (R and NR). The distribution of bug reports into R and NR categories is depicted in Table 4.

5.2 Implementation details

The proposed framework, NRPredictor has been evaluated using bug reports obtained from three long lived projects of Bugzilla BTS: NetBeans, Eclipse, and Mozilla Firefox. The general procedure of implementation is depicted in Fig. 5. First, the bug reports from the bug repository with known labels (R or NR) are collected and are considered as ground truth. Next, 8 numerical (component, severity, priority, operating system, hardware, version, number of comments and cc count) and 1 textual feature is extracted from bug reports. The textual feature is then preprocessed by using five procedures: tokenization, case conversion, punctuation removal, stopword removal and stemming. Next, the frequency of each token has been determined and 100 most frequent tokens are considered for feature vector creation. Further, feature selection technique called Classifier Attribute Evaluator has been used to further reduce the complexity of dataset. Finally, various NRPredictor models have been created using three machine learning algorithms (Bayes Net, Naive Bayes, Naive Bayes Multinomial Text, Naive Bayes Updateable, IBk, Zero- R, JRip, OneR, PART, Decision Table, J48, Rep Tree and Random Tree) and three ensemble learning strategies (bagging, boosting, and stacking) for each considered project. Finally the built models are evaluated by passing new bug reports as input. The proposed framework, NRPredictor, outputs a class label (R or NR) corresponding to the testing bug report.

5.3 Evaluation metrics

Four common performance assessment criteria were utilised to estimate the achieving power of NRPredictor framework: Precision, Recall, F1-Score, and Area under Receiver Operating Characteristic (ROC) curve. All of the above assessment criteria have been extensively utilised in the software debugging area (Hewett and Kijsanayothin 2009; Shihab et al. 2013; Xia et al. 2015). When utilising NRPredictor models to predict class label, there are four possible outcomes: (1) A bug is predicted as NR bug when it is truly NR (True Positive: tp), (2) Predicted as NR but is truly R (False Positive: fp), (3) Predicted as R when it is truly NR (False Negative: fn), or (4) Predicted as R when it is truly R (True Negative: tn). Different metrics such as accuracy, recall, F1-Score, and ROC are computed using these four values:

Table 5 Performance of conventional classifiers on NetBeans project
Table 6 Performance of conventional classifiers on eclipse project
Table 7 Performance of conventional classifiers on mozilla firefox project
  1. 1.

    Precision: It denotes the proportion of relevant occurrences found among the overall number of examples found. Equation 1 shows the formula for precision.

    $$ {\text{Precision}} = \frac{{tp}}{{tp + fp}}{\text{ }} $$
    (1)
  2. 2.

    Recall: It denotes the percentage of relevant occurrences found out of all relevant examples. Equation 2 shows the formula for recall.

    $$ {\text{Recall}} = \frac{{tp}}{{tp + fn}} $$
    (2)
  3. 3.

    F1- Score: There is a trade-off between precision and recall measurements. An rise in one statistic frequently results in a drop in the other. As a result, evaluating prediction performance using accuracy and recall is problematic. The F1-Score measure combines the advantages of accuracy and recall metrics. The weighted harmonic mean of precision and recall is represented by the F1-Score. It’s a frequently used metric for evaluating performance (Lal, et al. 2017; Xia et al. 2015). Equation 3 shows the formula for F1-Score.

    $$ F1 - {\text{Score}} = \frac{{(\beta ^{2} + 1)*{\text{Precision}}*{\text{Recall}}}}{{\beta ^{2} *{\text{Precision}} + {\text{Recall}}}}{\text{ }} $$
    (3)

    when \(\beta \) is equal to 1, F1-Score is calculated as shown in Equation 4

    $$ F1 - {\text{Score}} = \frac{{2*{\text{Precision}}*{\text{Recall}}}}{{{\text{Precision}} + {\text{Recall}}}}{\text{ }} $$
    (4)
  4. 4.

    Area under ROC curve: The ROC (Receiver Operating Characteristic) curve is a graph of the true positive rate (tpr) vs the false positive rate (fpr). The area under the resulting ROC plot is represented by the area under the ROC curve. It assesses the possibility of an NR bug report being assigned a greater likelihood than an R problem report. The ROC value might be anything between 0 and 1. A larger number in ROC value shows better prediction performance of the developed model of NRPredictor framework.

Table 8 Performance of bgging ensemble learning technique using conventional classifiers on netbeans project
Table 9 Performance of bgging ensemble learning technique using conventional classifiers on eclipse project
Table 10 Performance of bgging ensemble learning technique using conventional classifiers on mozilla firefox project

The effectiveness of NRPredictor models was assessed using the cross validation approach. When just a small number of data examples are present, cross validation is employed to provide an impartial estimate of the model’s performance. Data is separated into k equal-sized subgroups in k-fold cross-validation. As a result, the model is generated k times, each time utilising \((k-1)\) sets of data examples for training the learning classifier and one subset for testing predictions.

5.4 Research questions

The following set of research questions (RQs) are examined in this paper:

  • Research Question 1: What is the performance of traditional machine learning approaches for predicting bug report reproducibility?

  • Research Question 2: What is the performance of ensemble machine learning approaches for predicting bug report reproducibility?

  • Research Question 3: What is the performance of ensemble machine learning techniques after applying feature selection for predicting bug report reproducibility?

6 Results and analysis

This section discusses the results obtained corresponding to four research questions addressed in this work.

RQ1: Performance of traditional machine learning approaches

Tables 5, 6 and 7 examines the performance of thirteen base classifiers: Bayes Net, Naive Bayes, Naive Bayes Multinomial Text, Naive Bayes Updateable, IBk, Decision Table, Zero-R, JRip, OneR, PART, J48, Rep Tree and Random Forest from four families: Bayes, lazy, rules and trees to find the best fixation prediction classifier. Various evaluation metrics such as Precision, Recall, F1-Score and ROC has been computed for comparison purposes. For F1-Score, the experimental findings show that the PART classifier surpasses all other classifiers. PART classifier scored the greatest F1-Score of 81.3%, 82.8%, and 85.6% on NetBeans, Eclipse, and Mozilla Firefox projects, respectively.

Table 11 Performance of boosting ensemble learning technique using conventional classifiers on netbeans project
Table 12 Performance of boosting ensemble learning technique using conventional classifiers on eclipse project
Table 13 Performance of boosting ensemble learning technique using conventional classifiers on mozilla firefox project
Table 14 Performance of ensemble learning techniques using conventional classifiers on three considered projects

RQ2: performance of ensemble learning approaches

To address this RQ ensemble learning models have been created using three techniques: Bagging, Boosting and Stacking. Tables 8, 9 and 10 presents the results of bagging ensemble models using thirteen base classifiers on three considered projects. The experimental results reveal that Bagging using Random Forest algorithm performs better than other classifiers when considered F1-Score evaluation metric for comparison. The highest F1-Score of 83.1%, 84.4% and 86.6% was achieved by Bagging ensemble learners on NetBeans, Eclipse and Mozilla Firefox projects respectively. An improvement of 1.8%, 1.6% and 1% was achieved by Bagging models on NetBeans, Eclipse and Mozilla Firefox projects respectively as compared to best performing individual classifier performance (PART).

Table 15 Performance of ensemble learning techniques with feature selection using conventional classifiers on three considered projects

Tables 11, 12 and 13 presents the results of boosting ensemble models using thirteen base classifiers on three considered projects. The experimental results reveal that Boosting using Random Forest algorithm performs better than other classifiers when considered F1-Score evaluation metric for comparison. The highest F1-Score of 84.8%, 86.2% and 87.2% was achieved by Boosting on NetBeans, Eclipse and Mozilla Firefox proj- ects respectively. An improvement of 3.5%, 3.4% and 1.6% was achieved by Boosting models on NetBeans, Eclipse and Mozilla Firefox projects respectively as compared to best performing individual classifier performance (PART).

For stacking ensemble learning technique, all possible combinations of thirteen base classifiers have been used to learn the models. The experiments have been initialized from size three of base classifiers. In total, 8100 different model combinations corresponding to each considered project have been learned. Logistic Regression has been used as meta classifier. The performance and improvement of best model combination corresponding to each project has been reported in Table 14. The highest F1-Score of 86.1%, 85.1% & 87.5% was achieved by Stacking ensemble learning technique on NetBeans, Eclipse and Mozilla Firefox projects respectively. An improvement of 4.8%, 2.3% and 1.9% was achieved by Stacking technique on NetBeans, Eclipse and Mozilla Firefox projects respectively as compared to best performing individual classifier performance (PART). For NetBeans project, stacking with base classifier combination of PART, Random Forest, REPTREE, Naive Bayes and J48 obtained best performance. For Eclipse project, stacking with base classifier having combination of PART, REPTREE, JRip, Naive Bayes and Decision Table obtained best performance. For Mozilla Firefox project, stacking with base classifier combination of PART, REPTREE, Naive Bayes and Decision Table obtained best performance.

RQ3: performance of ensemble learning approaches with feature selection

To address this RQ, Classifier Attribute Evaluator has been used for feature selection before applying ensemble learning techniques for prediction. Table 15 summarizes the performance and improvement of various ensemble learning algorithms along with feature selection technique. The highest F1-Score of 83.1%, 84.4% and 86.6% was achieved by Bagging on NetBeans, Eclipse and Mozilla Firefox projects respectively. The highest F1-Score of 84.8%, 86.2% and 87.2% was achieved by Boosting on NetBeans, Eclipse and Mozilla Firefox projects respectively. The highest F1-Score of 87.4%, 87.8% and 88.3% was achieved by Stacking on NetBeans, Eclipse and Mozilla Firefox projects respectively. An improvement of 6.1%, 5% & 2.7% was achieved by Stacking on NetBeans, Eclipse and Mozilla Firefox projects respectively as compared to best performing individual classifier performance.

7 Threats to validity

Despite the fact that the experiments were structured in such a way that there are few risks to validity, there are still a number of decisions that might impact the findings of this paper. Various dangers to the validity of reported work are examined in this section.

7.1 External validity

The generalizability of generated outcomes is referred to as external validity. The bug reports included in this work’s experimental assessments came from three open-source Bugzilla repository projects: NetBeans, Eclipse, and Mozilla Firefox. Bug reports from these projects may differ from those from other open-source and closed-source projects. As a result, the findings of the paper might not apply to other open-source and commercial software projects. Other open-source and closed-source projects, as well as those that employ other development approaches, will require more research. Despite the fact that big open-source initiatives covering a wide range of topics have been investigated, there may be additional projects adopting diverse software techniques in the future. As a result, the conclusions may not apply to everyone.

7.2 Internal validity

The bias and mistakes in the experimental setting are referred to as internal validity. The data acquired from a bug repository was deemed to be ideal in this paper. However, there is a chance that the extracted data contains mistakes or noise, which might impact the paper’s conclusions. To counteract this risk, the dataset included in this work has been drawn from widely used projects if Bugzilla repository. These projects have a lengthy history and are continuously maintained, therefore the retrieved data should be considered acceptable (if not optimal). The input parameters used to train diverse machine learning models pose another danger to internal validity. In this paper, nine bug parameters were employed as input features for model training (eight numerical and one textual parameter). However, there may be an alternative collection of qualities that will produce superior results. However, there may be an other collection of characteristics that perform better for predicting NR fixability. A list of stop words offered by the Python NLTK toolkit was utilised to pre-process textual contents (http://www.nltk.org/). The Porter Stemmer tool from the Python NLTK toolbox was used to stem textual contents. For comparable procedures, this toolbox has been frequently utilised in the literature. Other stop word lists and stemming tools, on the other hand, may have an impact on prediction accuracy. To reduce the risk of code and experimental setup problems, the source code and experimental setup have been double-checked. However, there is still the risk of mistakes. In the experiments, 10 fold cross validation was utilised to eliminate bias.

7.3 Construct validity

The experimental constructs or the adequacy of the assessment measures utilised in the study are referred to as construct validity. The F1-Score was reported in this paper. This measure has been extensively used in the literature to assess the performance of machine learning classifiers, hence there is no concern about construct validity in this paper.

8 Conclusion

The bug management is onerous task for software engineers due to the unpredictable nature of bug fixes. The complexity of this perplexing indexing is exacerbated by non-reproducible faults. To address Non-reproducible bugs, a novel fixability prediction framework named NRPredictor is proposed in this paper. Thirteen traditional machine learning classifiers along with three ensemble learning approaches (Bagging, Boosting, and Stacking) and one feature selection technique have been leveraged in NRPredictor. The experimental evaluation shows that traditional machine learning algorithm, PART scored the greatest F1-Score of 81.3%, 82.8% and 85.6% on NetBeans, Eclipse, and Mozilla Firefox projects, respectively. Ensemble learning techniques outperforms traditional machine learning approaches, achieving F1-Scores of up to 86.1, 85.1, and 87.5% for NetBeans, Eclipse, and Mozilla Firefox applications, respectively. Feature selection with ensemble learning techniques achieves F1-Scores of up to 87.4, 87.8, and 88.3% for NetBeans, Eclipse, and Mozilla Firefox applications, respectively.

9 Future research directions

The performance of proposed framework, NRPredictor, may be investigated in future on closed-source applications. Collaboration with firms that use open-source and closed-source bug repositories to analyse the proposed framework, NRPredictor, in an industrial context is also possible. This will aid in further generalisation of findings. Text mining techniques such as topic modelling may also be used and integrated into the framework. In addition, a fix recommendation tool may be developed, which provide tokens for non-reproducible issues that might be solved. Another area of future study is the creation of a tool for software developers to aid in the prediction of NR bugs. Although the present study aids in the prediction of difficult-to-reproduce bugs that can be labelled as NR. However, NR issues continue to offer a significant barrier to the bug-fixing process; as a result, new ways for resolving NR-marked bug reports can be created.