Detecting Outliers in Terms of Errors in Embedded Software Development Projects Using Imbalanced Data Classification

Iwata, Kazunori; Nakashima, Toyoshiro; Anan, Yoshiyuki; Ishii, Naohiro

doi:10.1007/978-3-319-63618-4_6

Kazunori Iwata³,
Toyoshiro Nakashima^4,5,
Yoshiyuki Anan⁶ &
…
Naohiro Ishii⁷

Part of the book series: Studies in Computational Intelligence ((SCI,volume 726))

Included in the following conference series:

International Conference on Computational Science/ Intelligence & Applied Informatics

694 Accesses
1 Citations

Abstract

This study examines the effect of undersampling on the detection of outliers in terms of the number of errors in embedded software development projects. Our study aims at estimating the number of errors and the amount of effort in projects. As outliers can adversely affect this estimation, they are excluded from many estimation models. However, such outliers can be identified in practice once the projects have been completed; therefore, they should not be excluded while constructing models and estimating errors or effort. We have also attempted to detect outliers. However, the accuracy of the classifications was not acceptable because of a small number of outliers. This problem is referred to as data imbalance. To avoid this problem, we explore rebalancing methods using k-means cluster-based undersampling. This method aims at improving the proportion of outliers that are correctly identified while maintaining the other classification performance metrics high. Evaluation experiments were performed, and the results show that the proposed methods can improve the accuracy of detecting outliers; however, they also classify too many samples as outliers.

Access provided by CONRICYT-eBooks. Download chapter PDF

Machine Learning Algorithms Comparison for Software Testing Errors Classification Automation

Outlier Mining Techniques for Software Defect Prediction

An empirical study on predictability of software maintainability using imbalanced data

Article 05 August 2020

Keywords

1 Introduction

The growth and expansion of our information-based society has resulted in an increasing number of information products. In addition, the functionality of these products is becoming ever more complex [6, 14]. Guaranteeing the quality of software is particularly important because it relates to reliability. Therefore, it is increasingly important for corporations that develop embedded software to implement efficient processes while guaranteeing timely delivery, high quality, and low development costs [2, 12, 15, 16, 18,19,20,21]. Companies and divisions involved in developing of such software focus on a variety of improvements, particularly in their processes. Estimating the number of errors and the amount of effort is necessary for new software projects and guaranteeing product quality is particularly important because the number of errors is directly related to the product quality and the amount of effort is directly related to cost, which affect the reputation of the corporation. Previously, we investigated the estimation of total errors and effort using an artificial neural network (ANN) and showed that ANN models are superior to regression analysis models for estimating errors and effort in new projects [8, 9]. We proposed a method to estimate intervals for the amount of effort using a support vector machine (SVM) and an ANN [7, 10]. These models were constructed with data that excluded outliers. The outliers can be identified in practice once the projects have been completed. Hence, they should not be excluded while constructing models and estimating effort. We attempted to classify embedded software development projects based on verifying whether the amount of efforts was an outlier using an ANN and SVM [11]. However, the accuracy of the classifications was not acceptable because of a small number of outliers. This problem occurs in most machine learning methods and is referred to as data imbalance. It exists in a broad range of experimental data [1, 22]. Data imbalance occurs when one of the classes in a dataset has a very small number of samples compared to the number of samples in other classes. When the number of instances of the majority class exceeds that of the minority class by a significant amount, most samples are classified into a class to which the majority samples belong. Therefore, the number of the outliers is small, and they are classified as normal values. To avoid this problem, we explored rebalancing methods in terms of errors using k-means [5] cluster-based undersampling. Evaluation experiments were performed to compare the classification accuracy using k-means undersampling with that of random undersampling and no undersampling using ten-fold cross-validation.

2 Related Work

2.1 Undersampling

Undersampling is one of the most common and straightforward strategies for handling imbalanced datasets. Samples of the majority class are dropped to obtain a balanced dataset. Simple undersampling randomly drops samples to generate a balanced dataset.

2.2 Cost-Sensitive Learning

Unlike cost-insensitive learning, cost-sensitive learning is a type of learning that considers misclassification costs [17]. Additionally, cost-sensitive learning imposes different penalties for different misclassification errors. It aims at classifying samples into a set of known classes with high accuracy. Cost-sensitive learning is a common approach that solves the problem associated with imbalanced datasets.

2.2.1 Cost-Sensitive SVMs

SVMs have proven to be effective in many practical applications. However, the application of SVMs has limitations when applied to the problem of learning from imbalanced datasets. A cost-sensitive SVM, which assigns different misclassification costs, is good solution to address the problem [3, 13]. Such an SVM is develped using different error costs for the positive and negative classes, and can improve classification accuracy for a small number of classes.

2.3 Our Contribution

The above algorithm has a certain level of classification accuracy for some imbalanced datasets; however, it cannot improve the accuracy for highly imbalanced datasets. Therefore, in this research, we proposed a rebalancing method using k-means cluster-based undersampling.

3 Datasets and Outliers

3.1 Original Datasets

Using data from a large software company, the classification methods divide the number of anticipated errors into normal values and outliers. The data consist of the following features:

$ Class $: This indicates whether the total number of errors for an entire project is a normal value or an outlier. Predicting this value is the objective of the classification.
Volume of newly added steps ($ V_{new} $): This feature denotes the number of steps in the newly generated functions of the target project.
Volume of modification ($ V_{modify} $): This feature denotes the number of steps modified or added to existing functions that were needed to use the target project.
Volume of the original project ($ V_{survey} $): This feature denotes the original number of steps in the modified functions and the number of steps deleted from the functions.
Volume of reuse ($ V_{reuse} $): This feature denotes the number of steps in a function of which an external specification is only confirmed and which are applied to the target project design without confirming the internal content.

3.2 Determination of Outliers

This study examined the classification of outliers in terms of the number of errors in a project. Fig. 1 shows the distribution of the number of errors, whereas Fig. 2 is a boxplot of this metric. The lowest datum of the boxplot is 0, which is the lowest possible number of errors in the projects and higher than 1.5 times the interquartile range (IQR) of the lower quartile. The highest valid datum is within 1.5 times the IQR of the upper quartile. The outliers are denoted by circles. Here, the values are spread along the Y-axis to more clearly present the distribution of the outliers; however, the Y-coordinate has no other meaning. Of the total of 1,419 data points, 143 are outliers. Detailed values of the boxplot are listed in Table 1.

Table 1 Detailed information of the boxplot shown in Fig. 2

Detecting Outliers in Terms of Errors in Embedded Software Development Projects Using Imbalanced Data Classification

Abstract

Similar content being viewed by others

Machine Learning Algorithms Comparison for Software Testing Errors Classification Automation

Outlier Mining Techniques for Software Defect Prediction

An empirical study on predictability of software maintainability using imbalanced data

Keywords

1 Introduction

2 Related Work

2.1 Undersampling

2.2 Cost-Sensitive Learning

2.2.1 Cost-Sensitive SVMs

2.3 Our Contribution

3 Datasets and Outliers

3.1 Original Datasets

3.2 Determination of Outliers

4 Classification Methods

4.1 K-Means Cluster-Based Undersampling

5 Evaluation Experiment

5.1 Data Used in the Evaluation Experiment

5.2 Evaluation Criteria

5.3 Results and Discussion

6 Conclusion

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation