Keywords

1 Introduction

The growth and expansion of our information-based society has resulted in an increasing number of information products. In addition, the functionality of these products is becoming ever more complex [6, 14]. Guaranteeing the quality of software is particularly important because it relates to reliability. Therefore, it is increasingly important for corporations that develop embedded software to implement efficient processes while guaranteeing timely delivery, high quality, and low development costs [2, 12, 15, 16, 18,19,20,21]. Companies and divisions involved in developing of such software focus on a variety of improvements, particularly in their processes. Estimating the number of errors and the amount of effort is necessary for new software projects and guaranteeing product quality is particularly important because the number of errors is directly related to the product quality and the amount of effort is directly related to cost, which affect the reputation of the corporation. Previously, we investigated the estimation of total errors and effort using an artificial neural network (ANN) and showed that ANN models are superior to regression analysis models for estimating errors and effort in new projects [8, 9]. We proposed a method to estimate intervals for the amount of effort using a support vector machine (SVM) and an ANN [7, 10]. These models were constructed with data that excluded outliers. The outliers can be identified in practice once the projects have been completed. Hence, they should not be excluded while constructing models and estimating effort. We attempted to classify embedded software development projects based on verifying whether the amount of efforts was an outlier using an ANN and SVM [11]. However, the accuracy of the classifications was not acceptable because of a small number of outliers. This problem occurs in most machine learning methods and is referred to as data imbalance. It exists in a broad range of experimental data [1, 22]. Data imbalance occurs when one of the classes in a dataset has a very small number of samples compared to the number of samples in other classes. When the number of instances of the majority class exceeds that of the minority class by a significant amount, most samples are classified into a class to which the majority samples belong. Therefore, the number of the outliers is small, and they are classified as normal values. To avoid this problem, we explored rebalancing methods in terms of errors using k-means [5] cluster-based undersampling. Evaluation experiments were performed to compare the classification accuracy using k-means undersampling with that of random undersampling and no undersampling using ten-fold cross-validation.

2 Related Work

2.1 Undersampling

Undersampling is one of the most common and straightforward strategies for handling imbalanced datasets. Samples of the majority class are dropped to obtain a balanced dataset. Simple undersampling randomly drops samples to generate a balanced dataset.

2.2 Cost-Sensitive Learning

Unlike cost-insensitive learning, cost-sensitive learning is a type of learning that considers misclassification costs [17]. Additionally, cost-sensitive learning imposes different penalties for different misclassification errors. It aims at classifying samples into a set of known classes with high accuracy. Cost-sensitive learning is a common approach that solves the problem associated with imbalanced datasets.

2.2.1 Cost-Sensitive SVMs

SVMs have proven to be effective in many practical applications. However, the application of SVMs has limitations when applied to the problem of learning from imbalanced datasets. A cost-sensitive SVM, which assigns different misclassification costs, is good solution to address the problem [3, 13]. Such an SVM is develped using different error costs for the positive and negative classes, and can improve classification accuracy for a small number of classes.

2.3 Our Contribution

The above algorithm has a certain level of classification accuracy for some imbalanced datasets; however, it cannot improve the accuracy for highly imbalanced datasets. Therefore, in this research, we proposed a rebalancing method using k-means cluster-based undersampling.

3 Datasets and Outliers

3.1 Original Datasets

Using data from a large software company, the classification methods divide the number of anticipated errors into normal values and outliers. The data consist of the following features:

  • \( Class \): This indicates whether the total number of errors for an entire project is a normal value or an outlier. Predicting this value is the objective of the classification.

  • Volume of newly added steps (\( V_{new} \)): This feature denotes the number of steps in the newly generated functions of the target project.

  • Volume of modification (\( V_{modify} \)): This feature denotes the number of steps modified or added to existing functions that were needed to use the target project.

  • Volume of the original project (\( V_{survey} \)): This feature denotes the original number of steps in the modified functions and the number of steps deleted from the functions.

  • Volume of reuse (\( V_{reuse} \)): This feature denotes the number of steps in a function of which an external specification is only confirmed and which are applied to the target project design without confirming the internal content.

3.2 Determination of Outliers

This study examined the classification of outliers in terms of the number of errors in a project. Fig. 1 shows the distribution of the number of errors, whereas Fig. 2 is a boxplot of this metric. The lowest datum of the boxplot is 0, which is the lowest possible number of errors in the projects and higher than 1.5 times the interquartile range (IQR) of the lower quartile. The highest valid datum is within 1.5 times the IQR of the upper quartile. The outliers are denoted by circles. Here, the values are spread along the Y-axis to more clearly present the distribution of the outliers; however, the Y-coordinate has no other meaning. Of the total of 1,419 data points, 143 are outliers. Detailed values of the boxplot are listed in Table 1.

Fig. 1
figure 1

Distribution of the total number of errors (in intervals of 500 errors)

Fig. 2
figure 2

Boxplot of the number of errors

Table 1 Detailed information of the boxplot shown in Fig. 2

4 Classification Methods

The following classification methods were created to compare their accuracy:

  • SVM without undersampling (SVM w/o).

  • Cost-sensitive SVM without undersampling (CSSVM).

  • SVM with random undersampling (W/Random).

  • SVM with k-means cluster-based undersampling (W/n Clusters), where n is the number of clusters, which is varied from 2 to 15.

4.1 K-Means Cluster-Based Undersampling

The k-means clustering algorithm aims at finding the positions of clusters that minimize the distance from the data points to k clusters. The algorithm is often presented as a method that assigns samples to the nearest cluster by distance. The main steps of k-means are to select the initial cluster centers, change the classification of the data based on Euclidean distance and adjust the cluster centers according to the classification result. The clustering results are largely dependent on the initial cluster assignment. In this research, the clustering algorithm is applied to undersampling. The k-means cluster-based undersampling algorithm is shown in Algorithm 1.

figure a

5 Evaluation Experiment

5.1 Data Used in the Evaluation Experiment

To evaluate the performance of the proposed technique, we performed ten-fold cross-validation on data from 1,419 real projects. The original data were randomly partitioned into 10 equally sized subsamples (each subsample having data from 141 or 142 projects). One of the subsamples was used as the validation data for testing the model, while the remaining nine subsamples were used as training data. The cross-validation process was repeated 10 times, with each of the 10 subsamples used exactly once as validation data. An example of ten-fold cross-validation is shown in Fig. 3.

Fig. 3
figure 3

Ten-fold cross-validation

5.2 Evaluation Criteria

This study focused on the imbalance problem wherein the minority class (outliers) has much lower precision and recall than the majority class (normal values). Accuracy metrics place more weight on the majority class than on the minority class, which makes it difficult for a classifier to perform well on the minority class.

By convention, the class label of the minority class is positive, whereas that of the majority class is negative. The True Positive (\( TP \)) and True Negative (\( TN \)) values, as summarized in Table 2, denote the number of positive and negative samples that are correctly classified, while the False Positive (\( FP \)) and False Negative (\( FN \)) values denote the numbers of samples incorrectly classified as positive and negative, respectively.

Table 2 Confusion matrix

The following eight criteria were used as performance measures for the classification methods. The best value of these criteria is 1.0, whereas the worst is 0.0.

(1):

Accuracy (\( ACC \)) is the proportion of the total number of predictions and is calculated as the number of all correct predictions divided by the total number of samples using the following equation:

$$\begin{aligned} ACC = \frac{ TP + TN }{ TP + TN + FP + FN }. \end{aligned}$$
(1)
(2):

Precision (\( PREC \)) is the proportion of correctly predicted positive cases. It is calculated as the number of accurate positive predictions divided by the total number of positive predictions using the following equation:

$$\begin{aligned} PREC = \frac{ TP }{ TP + FP }. \end{aligned}$$
(2)
(3):

Sensitivity (\( SN \), recall, or \( TP \) rate) is the proportion of positive cases that are correctly identified. It is calculated as the number of correct positive predictions divided by the total number of positives using the following equation:

$$\begin{aligned} SN = \frac{ TP }{ TP + FN }. \end{aligned}$$
(3)
(4):

Specificity (\( SP \) or \( TN \) rate) is defined as the proportion of negative cases that are correctly classified and calculated as the number of correct negative predictions divided by the total number of negatives using the following equation (4):

$$\begin{aligned} SP = \frac{ TN }{ TN + FP }. \end{aligned}$$
(4)
(5, 6, 7):

The F-measure (\( F _{\beta }\)) is the harmonic mean of precision and sensitivity. It is calculated as a weighted (\(\beta \)) average of the precision and sensitivity as follows:

$$\begin{aligned} F_{\beta } = \frac{(1 + \beta ^2) \times PREC \times SN }{\beta ^2 \times PREC + SN }. \end{aligned}$$
(5)

F-measures \(F_{0.5}\), \(F_1\), and \(F_2\) are commonly used. The larger \(\beta \), the more importance sensitivity has in the equation.

(8):

The G-measure (\( G \)) is based on the sensitivity of both the positive and negative classes, calculated as follows:

$$\begin{aligned} G = \sqrt{ SN \times SP }. \end{aligned}$$
(6)

This paper aimed to detect all of outliers; however, there is a trade-off between precision and sensitivity. In generally, precision improves at the expense of sensitivity and sensitivity improves at the expense of precision [4]. Thus, improving \( SN \), \(F_2\) and G was important while maintaining the other classification performance metrics high.

5.3 Results and Discussion

For each method described in Sect. 4, the confusion matrices of the experimental results for all projects using ten-fold cross-validation are presented in Tables 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19 and 20. The values in the tables represent the aggregate over 10 experiments. The results in the tables are summarized in Figs. 4, 5, 6, and 7. High values in Figs. 4 and 5 mean high accuracy. In contrast, low values in Figs. 6 and 7 suggest high accuracy.

Table 20 summarizes the results of the criteria for all methods. The underlined values indicate the best results. The top five methods for each criterion are indicated in bold type. The results of the criteria for SVM w/o, CSSVM, W/Random and W/14 Clusters appear in Fig. 8. The results of W/n Clusters except W/14 Clusters are omitted because these show similar features.

Table 3 Classification results obtained using the support vector machine without undersampling (SVM w/o) Method
Table 4 Classification results obtained using the cost-sensitive SVM without undersampling (CSSVM) Method
Table 5 Classification results obtained using the SVM with random undersampling (W/Random) Method
Table 6 Classification results obtained using the W/2 clusters method
Table 7 Classification results obtained using the W/3 clusters method
Table 8 Classification results obtained using the W/4 clusters method
Table 9 Classification results obtained using the W/5 clusters method
Table 10 Classification results obtained using the W/6 clusters method
Table 11 Classification results obtained using the W/7 clusters method
Table 12 Classification results obtained using the W/8 clusters method
Table 13 Classification results obtained using the W/9 clusters method
Table 14 Classification results obtained using the W/10 clusters method
Table 15 Classification results obtained using the W/11 clusters method
Table 16 Classification results obtained using the W/12 clusters method
Table 17 Classification results obtained using the W/13 clusters method
Table 18 Classification results obtained using the W/14 clusters method
Table 19 Classification results obtained using the W/15 clusters method
Table 20 Accuracy comparison for all methods

The SVM w/o method shows the best \( ACC \); however it is second worst with respect to \( SN \). This is because most outliers are classified into a class of normal values (the majority class) specified in Table 3, which is a common problem with imbalanced datasets.

Fig. 4
figure 4

Number of \(\textit{TP}\) for methods

Fig. 5
figure 5

Number of \(\textit{TN}\) for methods

Fig. 6
figure 6

Number of \(\textit{FP}\) for methods

Fig. 7
figure 7

Number of \(\textit{FN}\) for methods

The CSSVM method obtains a perfect result for \( PREC \); however, it obtains the worst result for \( SN \). This is because the method imposes heavier costs for misclassifying outliers and can reduce \( FP \). At the same time, it increases \( FN \). The results indicate that CSSVM can accurately detect some outliers; however it overlooks most outliers. In other words, the outliers predicted by CSSVM must be actual outliers; however, only a few outliers are detected by it.

The results of the methods with undersampling show similar results. They have higher \( SN \), \(F_2\) and G but lower \( PREC \) than those of the methods without undersampling. The W/Randoms method obtains higher \( PREC \) than those of the methods with k-means cluster-based undersampling. In contrast, the W/n Clusters methods have better results in terms of SN. In addition, the classification criteria accuracy depends on the number of clusters, and all results of the W/14 Clusters method are within the top five. These results show that the proposed methods can improve the accuracy of detecting outliers; however they tend to classify too many samples as outliers.

Fig. 8
figure 8

Accuracy comparison of four methods

6 Conclusion

This research examined the ability of undersampling to detect outliers in terms of the number of errors in embedded software development projects. The undersampling method was based on the k-means clustering algorithm. The method aimed at improving the proportion of outliers that were correctly identified while maintaining the other classification performance metrics high.

Evaluation experiments were conducted to compare the prediction accuracy of the methods with k-means undersampling, random undersampling and without undersampling using ten-fold cross-validation.

The results indicated that the methods with undersampling have higher sensitivity and lower precision than those of the methods without undersampling. The results further indicated that the proposed methods improved the accuracy of detecting outliers but classified too many samples as outliers.

In future research, we plan to investigate the following:

  1. 1.

    We plan to apply the oversampling method to improve the accuracy of precision, while keeping sensitivity high.

  2. 2.

    We intend to consider other methods to detect outliers.

  3. 3.

    More data are needed to further support our research. In particular, data for projects that include outliers are essential for improving the models.