1 Introduction

The use of software has grown exponentially over the years due to the increase in the demand for automated and efficient systems (electronic gadgets). Embedded software in digital and electronic systems has helped to minimize faults and make tasks easier and faster with little or no human intervention. The benefits derived from the utilization of accurate and efficient software cannot be achieved without the utilization of software bug detectors.

The role of software bug predictors cannot be over-emphasized [1, 2] in software quality assurance [3, 4]. In recent years, researchers have implemented several machine learning techniques for the prediction of software defects [5,6,7]. Common models have combined variants of Artificial Neural Networks (ANNs) with regression or classification models [8, 9]. Recent studies also revealed that the use of sub-optimal default parameter settings degrades the efficiency of these models over time [10,11,12,13], resulting in the high consumption rate of computing resources (memory and time) and reducing the accuracy of software bug predictors [14]. As the prediction time and overall cost of software tests increase [15, 16], the primary challenge for every researcher and software developer is to improve the quality of software with limited testing resources through vectorization (a technique adopted by CPU manufacturers to improve the performance of modern computers) [17, 18].

Vectorization is a technique for transforming an algorithm from operating on a single or unitary value to multiple values (vectors) per time unit [19]. It has been adopted to speed up program execution, make programs more efficient and readable. In this paper, an optimized ensemble machine learning model (Logistic Regression and Extra Tree classifier) is implemented at the learning and testing phases to address the limitations of single machine learning techniques [20] to generate a better result, eliminate problems arising from imbalanced classes using a vectorization technique for faster execution, and maximize hardware resources.

The organization of this paper is further divided as follows. Severity-based classification and software bug debugging priorities are explained, immediately followed by a literature review of recent research work in Sect. 2. Section 3 presents the methodology for developing the optimized ensemble machine learning model for software bugs prediction. Implementation and test results derived from the developed optimized ensemble model's evaluation and analysis are documented in Sect. 4, the discussion and conclusion are given in Sect. 5 accordingly.

2 Related works

Large software project development requires modularization for easier and faster implementation, especially when two or more programmers are involved [15, 21]. After the compilation of the various modules, the integration of software bug predictor into various software modules [11, 22, 23] becomes rigorous and time-consuming. Thus, the inclusion of bug predictors at the earliest stage of software development is necessary. Software bug predictors (SBPs) aid in the identification and detection of bugs [24] by using collected software metrics as parameters [25, 26]. Metrics used include comment line codes [27], modularized codes, web-related metrics for web-based programs [21], network metrics [28], cascading style metrics [29], and other metrics based on mutations that have been identified for measuring software quality [30,31,32]. Software bugs appear with varied levels of severity and debugging priorities as shown in Table 1. These severity levels are also categorized as:

  1. a.

    Critical: At this stage, a system's overall performance is impacted by the software flaw. There is no workaround for it. It might also lead to the software modules' complete failure or unsuccessful installation

  2. b.

    Major: A software bug is categorized as having a major severity level if it has an impact on crucial data or major computer system functions. It happens when a module's malfunction prevents other modules from operating effectively, which lowers the system's performance.

  3. c.

    Minor: A few minor functions or non-critical data are affected. It has an easy workaround. A defective non-functional feature in a module could be improved to function in another module of the same software.

  4. d.

    Trivial: The functionality of the system or its data is not affected. The system's efficiency is not affected, but it may merely cause a form of inconvenience to the user.

Table 1 Software bugs debugging priorities

The impact of bugs in software is very challenging, and the cost of fixing the bugs varies significantly with the number of bugs and the time of detection [3, 20, 26]. Detecting bugs at an early stage of development would significantly lower the cost incurred in fixing them [27, 33].

Improving software defect prediction accuracy in recent times involves the combination of two or more techniques known as the ensemble or hybrid approach [6, 8, 34]. The systematic review of literature conducted on the forty-six most relevant papers published by four well-organized libraries [35], including IEEE and Springer, from 2012–2021 led to the discovery of WEKA as the most widely used platform for machine learning and the listing of frequently used evaluation metrics in software bug evaluation metrics to include F-measure, Area under curve (AUC), Recall, and Precision score. Rarely used ensemble approaches were identified, including Extra Tree, voting, and stacking.

Authors [36] implemented and tested a genetic algorithm-based optimization model for the prediction of software defects on six attributes of the NASA dataset. They formulated two hypotheses and four linear equations to justify their claim on the need for optimization techniques. The model was implemented on MATLAB and the evaluation metric was simply performed on accuracy and sensitivity level which yielded an average value of 95–97%.

Yi et al. [37] proposed an automated machine learning model for the automatic reconstruction of buildings using photogrammetric and panchromatic images derived from satellite images. Their approach adopted a combination of machine learning models to filter out non-building-related images, building images were refined with edged-vectorized and conditional Corner Generative Adversarial Networks (CGAN). Collected images were trained and tested on Worldview-1 data from Berlin, Germany. Standard metrics used for qualitative evaluation of their model include normalized Median Absolute Deviation (NMAD), Mean Absolute Error (MAE) and Root Mean Square.

Considering the deployment of a hybrid approach to software defects prediction, Mohd and Mohd [6] applied principal component analysis (PCA) for the appropriate feature selection and time complexity reduction in a developed support vector machine model to predict and detect software defects on PROMISE (a collection of three software defect dataset from NASA). The model was implemented on the Spyder program platform. The accuracy values recorded were claimed to be improvements over the previous ones.

Yuzhou [19] implemented a vectorized deep-learning approach for enhancing malware detection. A combination of two networks, the Convolution and Re-Current Neural Networks, was fused to achieve better prediction result accuracy. Effective features extracted were done through the semantic and instruction layers of their proposed architecture. Their proposed model named the Glove model represented each opcode sequence as a K-dimensional real vector given in matrix form. The results obtained showed a higher-class accuracy of 77.38%, feature detection of 83.95% and in comparison, with other machine models, generated the highest result of 88.89% to outperform others.

The collective work of people [10] concentrated on the effects of employing biased performance to forecast and measure defective software prediction models. They documented their findings on several software classification criteria in their thorough analysis of more than thirty peer-reviewed studies from 2012–2020. It was found that a popular metric for assessing defective software in prediction experiments is the F1-Measure.

Estimating errors in computing systems due to software defects can help improve performance and software assurance levels [36, 38]. A combination of classification techniques (naïve Bayes, decision tree) and Particle Swamp Optimization technique was used in developing a software flaw prophecy system [39]. The system was able to determine the interaction between quality software metrics and defective modules.

The development of a Neuro-Vectorizer for end-to-end reinforcement learning in software development by researchers [40] proved efficient in overcoming errors caused by compilers while generating loops and determining the number of instruction packs. A designed framework model which supported multiple machine learning models was trained using a dataset containing loops and non-loop functions with about 5,00,000 steps to achieve a better reward in terms of compilation time, execution time, and generated assembly code size. Their model's performance was benched marked with other models including random search, decision tree, and brute-force algorithms. However, their result showed an average of 3% value loss than the brute-force solution.

In a similar research conducted [17] to determine the performance of microprocessors with compilers to efficiently perform parallelism techniques on processes. Available vectorization policies including the Super Wood Level parallelism (SLP) on integer linear programming solver were tested for better performance comparison. A graph vectorized formulation was performed using Markov Decision Process (MDP) and trained on a Gated Graph Neural Network with 35,635 data functions derived from benchmarked suites (SPEC 2006, SPEC 2017, and NAS). The goSLP vectorized dataset representing 11.2% was used for training and un-vectorized samples were further grouped in the ratio of 4:1 for both training and testing of the model, respectively. Their work showed that the auto-vectorization technique is better adopted for the compiler to improve parallelism within programs.

The proposition of Maria et al. [28] about the quality of software not being judged by a single metric value but an aggregate of software characteristics and jointly distributed metric values led to the development of a probabilistic approach (COPULA)which was trained and tested on a software metric dataset. The model was evaluated and compared with correlation, agreement and distance ranking measurements. The results derived proved the model was accurate and scalable in performance.

Seven clustering approaches including K-means, Expectation Maximization (EM), and Sequential Information Bottleneck (SIB) were evaluated [38] to detect software defects on eight different NASA datasets. The results from their comparison analysis show the varying accuracy level and the initiated a caution on the choice of clustering technique to be adopted in developing software defects predictors.

As different classification techniques become more useful in machine learning for differentiating between buggy software [21] a similar technique was successfully utilized for multi-class classification of online products based on machine learning models (SVM and K-NN) [41]. Exploratory data analysis was performed to identify salient attributes and normalized them. The hashing vectorizer and the term frequency-inverse document frequency (Tf-idf) vectorizers were evaluated on three test phases (recall classifications, quality and speed text). The accuracy results derived from the evaluation of both models were recorded as 79 and 2% for both SVM and K-NN, respectively.

3 Research methodology

Several machine learning models have been developed from works of literature. The technique adopted in this paper began with the transformation of the collected textual National Aeronautics and Space Administration (NASA) program software dataset to meaningfully explored vectors, trained and tested for predicting and detecting software bugs.

3.1 Data description and modeling

The collected National Aeronautics and Space Administration (NASA) program dataset for software bug prediction comprises fourteen (14) sub-datasets. Each subset contains software parametric attributes as shown in Table 2 and is categorized by name ranging from CM1, JM1, KC1, KC2, KC3, MC1, MC2, MW1, MW2, PC1, PC2, PC3, PC4, to PC5. The data have been formatted in CSV file format and zipped for use. From the available collection of the NASA dataset, the PC5 was selected for use due to the availability of the largest number of data items and a precisely complete dataset with no missing value. The total number of instances available in the PC5 dataset is sixteen thousand nine hundred and sixty-two (16,962) with thirty-eight (38) attributes for prediction.

Table 2 Parametric software bugs attributes and numerical range

3.2 Data dimension and class vector representation

A set of independent variables (vn) associated with each instance in the PC5 dataset is collected to form set a new set (S) as shown in Eq. (1). Two output classes specified for the proposed ensemble model are the defective and non-defective classes represented as β1 and β2, respectively (Fig. 1).

$$ S = \left\{ {\begin{array}{*{20}c} {V_{11} } & {V_{12} } & {V_{13} } & \ldots & {V_{138} } \\ {V_{21} } & {V_{22} } & {V_{23} } & \ldots & {V_{238} } \\ {V_{31} } & {V_{32} } & {V_{33} } & \ldots & {V_{338} } \\ {V_{16291} } & {V_{16292} } & {V_{16293} } & \ldots & {V_{162938} } \\ \end{array} } \right\} $$
(1)

where v-independent variable attribute.

Fig. 1
figure 1

Instances classification of the NASA dataset

3.3 Feature selection and dimension reduction

The principal component analysis (PCA) is used to extract the required features to generate a reduced dataset having a strong relationship with the output from the entire vectorized PC5 dataset to enhance better prediction.

Given the vectorized PC5 dataset with a 38-dimensions vectorized variables (v) and mathematically represented in Eq. (1), a lower-dimensional representation of S called the reduced set [R] is derived using Eq. (2). A total of five attributes derived from the vectorized dataset are used for the prediction of software bugs.

$$ R = \, \left\{ {r1, \, r2, \, r3,.., \, rk} \right\},{\text{ with }}k \, \le \, n $$
(2)

where R captures the data attributes (v) in the PC5 dataset according to the variance maximization criterion. Elements in R are the hidden components, k is the number of features and n is the number of instances. In this, the PCA maps data from the space of v-variables to a new space of uncorrelated k- variables over the dataset. With the dimension reduction algorithm adopted the best five features were selected. The features and their corresponding rank values are tabulated in Table 3.

Table 3 PCA selected features for software bugs prediction

3.4 Ensembled software bug predictor flow diagram description

The research flow diagram in Fig. 2 depicts the implementation stages of the ensembled (Logistic regression and Extra tree) classifier model for software bug prediction.

Fig. 2
figure 2

Optimized Ensemble Software Bugs Prediction flow diagram

The initial process involves collecting the PC5 software metric dataset. The dataset was vectorized into an array of M × N matrix where M denotes the initial thirty-eight (38) software variables and N represents a total of 16,962 instances. The vectorized dataset is subjected to PCA for attributes reduction. Thus, yielding five (5) significant software attributes of the pre-processed dataset that was grouped in the ratio 3:2 for the training and testing of the ensemble software bug predictor model. The ensemble model classifier is developed and trained with the sixty percent vectorized data until a well-optimized value of the proposed model is obtained.

3.5 Model’s ensembled classifiers training and testing algorithms

The operation of the ensembled software bugs prediction model is dependent on some algorithms. These algorithms are implemented at successive intervals for the effective operation and accurate prediction of the model. Each instance of the dataset is vectorized and converted to an array of m × n matrix. The PCA algorithm reduced the vectorized array to a 5-dimensional feature array. The ensemble algorithm combining logistic regression and Extra Trees classifiers were further used for the training, testing, prediction and detection of software bugs.

The Logistic regression classifier models the probabilities for the software bugs classification problem with two possible outcomes. Its application is to determine the class (defective, non-defective) of each software based on the available software prediction metrics. The equation representing the logistic regression model is given as:

$$ {\text{logistic }}\left( \emptyset \right) = \frac{1}{{1 + \exp \left( { - \emptyset } \right)}} $$
(3)

where \(\emptyset\) = the probability of a successful expectation.

In addition, the randomized trees classifier (Extra Tree classifier) combines the result of multiple decision trees collated as a forest for the software defects classification problem. The software bugs prediction is performed by computing the average from the Logistic regression classifier and the majority count from decision trees. Training is done by each tree with the best-selected attributes and split randomly based on predetermined functions.

4 Implementation and results

All algorithms are implemented on an Intel Core™ i5-5020U CPU @ 2.20GHZ workstation with 8 Gb RAM, 1 TB of hard disk drive and installed WEKA, MATLAB and PyCharm software. The implementation stages are:

  • Loading and fitting the dataset. The NASA dataset is loaded and attributes are perfectly fitted in equivalent rows and columns (Fig. 3).

  • Exploratory data analysis: Hidden information derived from the NASA data is analyzed using a correlation table. Figure 4 reflects the scatter plot of significant attributes and highly ranked attributes based on their representative value to the dataset.

  • Training, testing and model evaluation: Training is performed on sixty percent of the well-prepared standardized dataset using the ensemble model’s algorithm at repeated intervals. The holdout data corresponding to forty percent is also tested to determine the model’s accuracy, precision, recall and F-measure scores. The mathematical equations for these evaluation metrics are given in Eqs. (47), respectively.

    $$ {\text{Accuracy }} = \, \left( {\left( {{\text{TP }} + {\text{TN}}} \right) \, / \, \left( {{\text{TP}} + {\text{TN}} + {\text{ FP}} + {\text{FN}}} \right)} \right) \, * \, 100 $$
    (4)
    $$ {\text{Precision }}\left( P \right) \, = \, \left( {{\text{TP}}/ \, \left( {{\text{TP}} + {\text{FP}}} \right)} \right) \, * \, 100 $$
    (5)
    $$ {\text{Recall }}\left( R \right) \, = \, \left( {{\text{TP}}/ \, \left( {{\text{TP}} + {\text{FN}}} \right)} \right) \, * \, 100 $$
    (6)
    $$ F - {\text{Measure }}\left( {F - {\text{Meas}}.} \right) \, = \, 2 \, *\left[ { \, \left( {P \, x \, R} \right) \, / \, \left( {P \, + \, R} \right)} \right] $$
    (7)

where True Positive (TP): Bugs detected by the system, False Negative (FN): Bugs not detected by the system, False Positive (FP): Codes detected as bugs by the system, True Negative (TN): Codes not detected as bugs by the system (Fig. 5).

Fig. 3
figure 3

Software bug attributes visualized in WEKA

Fig. 4
figure 4

Scatter diagram and attributes rank spread in the dataset

Fig. 5
figure 5

Training time plot of the vectorized dataset in both software

4.1 Software predictions, result analysis and evaluation

Additionally, automatically generated classification metrics from the two software platforms (WEKA and MATLAB) deployed for the model’s simulation include Mean Absolute Error (MAE), Root-Mean-Square Error (RMSE), training, validation and testing durations as shown in Figs. 6, 5, Tables 4 and 5.

Fig. 6
figure 6

MAE and RMSE result plot from different machine learning models

Table 4 WEKA classification metric of machine learning models
Table 5 MATLAB classification metric of machine learning models

5 Discussion and conclusion

This study has effectively adopted a non-widely used optimized ensemble learning approach logistic regression and extra tree classifier to classify and predict software bugs on the National Aeronautics and Space Administration (NASA) Metrics Data Program defect dataset with implementation on two different machine learning platforms. The main aim of this approach is to compare the effectiveness of the developed ensemble model on both platforms, discover trade-offs and make a better choice for software selection during software bug prediction modeling.

In addition, the comparative analysis of the results recorded in Tables 4 and 5 showed that the MATLAB-generated values were higher than the values obtained from the WEKA platforms with differences ranging from 0.01–3.7. The highest value (3.7) was obtained in the F-measure score of the tested Logistic regression model. Observable records in Fig. 5 also showed that the highest duration of time (459 s) was utilized at the training phase of the ensemble model on the MATLAB platform. This implied that the training of the model with the software bugs dataset was faster on the WEKA platform than on the MATLAB.

The ensemble model was trained with scalar and vectorized datasets for memory space utilization comparison at both the training and testing phases as shown in Fig. 7. At the testing phase, the actual memory usage for the prediction with the vectorized dataset is lesser than the un-vectorized data. It was noted that there exists a trade-off between memory space utilization and time duration with the use of scalar and vectorized data for software bug predictions.

Fig. 7
figure 7

Memory usage (bytes) for vectorized and vectorized datasets at testing phase

Furthermore, Fig. 3 displays some software bug attributes as visualized on the WEKA platform. Two classes (Y = Yes and N = No) were denoted as the output for the classification of the software bug prediction model. If the prediction outcome is a “yes” then the tested software is categorized as defective, else it is categorized as non- defective.

Although users may exhibit a few doubts in the choice of software and selection of the best model, there exists a perfect fit (the developed optimized ensemble) model that balances the limitations experienced in the individual model. The results generated by the optimized ensembled model in comparison with other related machine learning models as shown in Figs. 6, 7 and 8 proved the model’s efficiency and accuracy (Fig. 9).

Fig. 8
figure 8

Evaluation results compared to the optimized ensemble model

Fig. 9
figure 9

Time analysis of ensembled optimized with scalar and vectorized dataset

Conclusively, the optimized ensemble model is a very simple implementable model with a high bug prediction accuracy, efficient memory utilization and minimum execution time.