1 Introduction

With the frequent changes in the software source code to meet the changing and enormous requirements of the users, a large number of bugs are being reported on daily basis. A bug may be reported by a user, a developer, or by any staff member. The reported bugs should be analyzed carefully to find it out whether the bug is correct or incorrect, valid or invalid, important or unimportant, new or duplicate. We assign priority to a bug so that an important bug should not be left untreated for a long time. Correct prioritization is again a problem. The reporter may not have the complete knowledge of the project which may result in incorrect prioritization. Triager (a person who analyzes, expands and overall refines the bugs) with his knowledge and experience assign the priority to the bug. Manually doing this is a cumbersome task. We need to automate the process of bug priority prediction.

Cross project study is getting an edge in software engineering to predict bugs, cost, bug fix time, severity and priority and some other attributes or project property on the basis of historical data of other projects. In the available literature, very few attempts have been made in cross project study and very few authors have attempted to work on cross project validation for priority prediction. Recently, Sharma et al. (2012) have made an effort to predict the priority of a reported bug in cross project context but it was for a limited number of cases and combination of datasets for developing training candidate has not been considered which is proven to be very successful. In this paper we have made an attempt to predict the priority of a reported bug using different machine learning techniques and investigated the priority predictions in the cross project context by considering the combination of datasets for training candidates. To study and investigate the priority prediction in intra and cross project context, we have set the following research questions:

Research Question 1::

Which machine learning technique is most appropriate for priority prediction?

Research Question 2::

How performances of different machine learning techniques vary with number of terms?

Research Question 3::

Does training data from other projects provide acceptable priority prediction results?

Research Question 4::

Does the combination of training data sets provide better performance than single training data set?

We have visualized the empirical evaluation and experimental setup by creating three scenarios.

Scenario 1:

10 fold cross-validation within same project dataset

Scenario 2:

Cross project validation across different project datasets

Scenario 3:

Cross project validation by combining different project datasets as training candidate

Scenario 1 answers the research questions 1 and 2. Scenario 2 answers the research question 3 and scenario 3 answers research question 4.

We have applied Support Vector Machine (SVM), Naïve Bayes (NB), K-Nearest Neighbors and Neural Network classification techniques on bug repositories of open source projects namely Eclipse and OpenOffice. The performance measures namely accuracy, precision, recall and F-measure have been calculated using 10 fold cross-validation with stratified sampling. Rest of the paper is organized as follows: Section 2 provides related research work. Section 3 discusses about the pre-processing and representation of textual data (summary) of bug reports. Section 4 describes the dataset and features selected. Section 5 deals with the experimental setup. Section 6 discusses about the results. Section 7 mentions the threats to validity of results and finally, the paper is concluded in Section 8 with future research directions.

2 Related work

Attributes of a bug can be used to predict the priority, severity, fixer and status of the bug. Summary field represents the information about the bug (what this bug is). By mining the summary field and using machine learning techniques we can predict various attributes for new bug report. Canfora and Cerulo (2006) have proposed a study on how change requests have been assigned to developers involved in an open source project (Mozilla and KDE) and a method to suggest the set of best candidate developers to resolve a new change request. An approach consisting of constructing a recommender for bug assignments has been proposed by Anvik (2006). Anvik et al. (2006) applied SVM, Naive Bayes and Decision Trees on Eclipse, Firefox and GCC projects for automatic assignment of developer to a new bug report. Tamrawi et al. (2011) proposed a fuzzy set-based approach for automatic assignment of developers to bug reports. Weib et al. (2007) presented an approach to predict the fixing effort for an issue allowing early effort estimation, helping in assigning issues and scheduling stable releases. Kim and Whitehead (2006) computed the bug-fix time of files in ArgoUML and PostgreSQL by identifying when bugs are introduced and when the bugs are fixed. Lamkanfi et al. (2010) investigated whether we can accurately predict the severity of a reported bug by analyzing its textual description using text mining algorithms. Authors concluded that it is possible to predict the severity with a reasonable accuracy (both precision and recall vary between 0.65–0.75 with Mozilla and Eclipse; 0.70–0.85 in the case of GNOME). Chaturvedi and Singh (2012) have made an attempt to demonstrate the applicability of machine learning algorithms namely NB, K-Nearest Neighbor, NB Multinomial, SVM, J48 and RIPPER in determining the severity class of bug report data of NASA from PROMISE repository. Menzies and Marcus (2008) presented a new and automated method named SEVERIS (SEVERity ISsue assessment), to assist the test engineer in assigning severity levels of defect reports of NASA’s Project and Issue Tracking System (PITS). Anvik and Murphy (2011) proposed a recommender for development oriented decisions like assignment of developer to a new bug report, prediction of the component for new bug report. Marks et al. (2011) have done the analysis for different features of bug reports to find the characteristics of bug fix-time using Mozilla and Eclipse bug repositories. Sharma et al. (2013) predicted the cc-list count of a bug by using different regression techniques. Yu et al. (2010) predicted the priority of a bug during software testing process using artificial neural network and NB classifier. Kanwal and Maqbool (2010, 2012) used a classification based approach to compare and evaluate the SVM and NB classifiers to automate the prioritization of new bug report by using the categorical and textual attributes of bug report to train the model. They have shown that SVM performance was better than the NB with textual attributes and NB performance was better than SVM for categorical attributes. But this analysis has been carried out on a limited data and techniques. Cross project study is a new and challenging task.

Our study on cross project priority prediction is motivated and based on the following papers:

Recently, Sharma et al. (2012) conducted a study to predict the bug priority within project using SVM, K-NN, NB and Neural Net, and found that SVM and Neural Network are better than K-NN and K-NN is better than NB. He et al. (2012) have done an investigation on the feasibility of cross-project defect prediction and found that cross-project defect prediction is better than prediction when training data set is from the same project. Zimmermann et al. (2009) performed a large scale experiment on Data versus Domain versus Process cross-project defect prediction. Turhan et al. (2009) have done an empirical study on the relative value of cross-company and within-company data for defect prediction.

3 Preprocessing and representation of data

We have predicted the priority of a bug report based on its summary attribute entered by user at the time of bug filling. We pre-processed the bug summary in Rapid Miner tool containing the following steps:

3.1 Tokenization

Tokenization is the process of breaking a stream of text into words, phrases, symbols, or other meaningful elements called tokens. In this paper a word or a term has been considered as token.

3.2 Stop word removal

In bug summary prepositions, conjunctions, articles, verbs, nouns, pronouns, adverbs, adjectives, etc. are stop words and have been removed.

3.3 Stemming to base stem

The process of converting derived words to their base word is known as stemming. In this paper, we have used Standard Porter stemming algorithm (Porter 2008) for stemming.

3.4 Feature reduction

Tokens of minimum length 3 and maximum length 50 have been considered because most of the data mining algorithm may not be able to handle large feature sets.

3.5 Weight by information gain or infogain

Information gain tells the importance or relevance of the term or token.

To make textual data structured for analysis it is represented as document*term matrix where the rows are considered as documents or files and columns are considered as terms or tokens. The frequency of a term in the document will be counted and stored in the matrix form. TFi is the occurrences of a term in the document. To normalize TFi it is divided by total number of terms in the document, n is the total number of terms in the document.

TF × IDF represents “term frequency (TF) times inverse document frequency (IDF)”. It determines the importance of a term in the complete dataset or document set. Importance of a term increases with the frequency count of the term in the document but is offset by the frequency of the word in the dataset. The inverse document frequency is obtained by dividing the number of all documents by the number of documents containing the term, and then taking the logarithm of that quotient. This representation is used to rank the terms and selecting top few terms.

$$ {\text{Wi}} = {\text{TFi}} \times {\text{IDFi}}\,\,{\text{where}}\,{\text{IDFi}} = \log \,(N/{\text{DFi)}} $$

DFi is the document frequency which shows the appearance of a particular term in the number of documents and N is the total number of documents.

Different accuracy measures namely accuracy, precision, recall and F-measure can be calculated to measure the performance of a classifier.

Accuracy of a classifier is defined as proportions of classifications, over all the N examples that were correct. It is the percentage of correct classification.

Precision of a class A is defined as number of instances correctly classified as class A divided by the total number of instances classified as class A. It measures the percentages of correct predictions related to the predictions made by the classifier.

$$ {\text{Precision}} = \frac{{{\text{No}} .\,{\text{of}}\,{\text{instances}}\,{\text{correctly}}\,{\text{classified}}\,{\text{as}}\,{\text{class}}\,{\text{A}}}}{{{\text{Total}}\,{\text{number}}\,{\text{of}}\,\,{\text{instances}}\,{\text{classified}}\,{\text{as}}\,{\text{class}}\,{\text{A}}\,}} $$

Recall of a class A is the number of instances correctly classified as class A divided by the total number of instances in the dataset having class label A. It measures the percentage of correct predictions related to actual class.

$$ {\text{Recall}} = \frac{{{\text{No}} .\,{\text{of}}\,{\text{instances}}\,{\text{correctly}}\,{\text{classified}}\,{\text{as}}\,{\text{class}}\,{\text{A}}}}{{{\text{Total}}\,{\text{no}} .\,{\text{of}}\,\,{\text{instances}}\,{\text{in}}\,{\text{dataset having}}\,{\text{class label}}\,{\text{A}}\,}} $$

F-measure is calculated to measure the average performance of the classifier to avoid the bias towards precision or recall. It is twice of the harmonic mean of precision and recall.

$$ {\text{F - Measure}} = \frac{{2 \times {\text{Precision}} \times {\text{Recall}}}}{{\left( {{\text{Precision}} + {\text{Recall}}} \right)}} $$

4 Dataset and feature selection

In this paper, we have collected bug reports from Eclipse and OpenOffice projects. These projects are very popular in terms of their usage. Eclipse projects provide a development platform with extensible frameworks, tools and runtimes for building and managing software throughout its life time. In the Eclipse project, we have considered 6 products and 49 components. We have used only bug reports for platform product which has 21 components because other components do not have sufficient number of bug reports. Bug reports were collected up to 13th July 2012 from http://bugs.eclipse.org/bugs/. For the OpenOffice, we have collected Database Access, Presentation and Spreadsheet products bug reports taken up to 17th July 2012 from http://issues.apache.org/ooo/. We have used summary feature of bug report to predict the priority of newly coming bug. Only resolved bug reports with status value “resolved”, “closed” or “confirmed” with “fixed” and “duplicate” resolution have been taken because only these types of bug reports contain the meaningful information for building and training the models. Table 1 shows priority wise bug reports of different Eclipse and OpenOffice products.

Table 1 Number of bug reports in different projects

5 Experimental setup

To conduct the experiment, we made an automated workflow in Rapid Miner (Mierswa et al. 2006) containing steps for preprocessing of bug reports, model building, cross validation and model testing.

A graphical presentation for generating training datasets for Database Access (DB) product of OpenOffice project in scenario 2 and 3 has been shown in Fig. 1.

Fig. 1
figure 1

Procedure of generating training sets for DB

6 Results and discussion

Scenario 1 we have applied different machine learning techniques namely SVM, Naive Bayes, K-Nearest Neighbors and Neural Network for predicting the priority of reported bug using 10 fold cross-validation and varying number of terms from 25 to 200 (Sharma et al. 2012). We have tried different kernels and other parameters and then we used the appropriate parameters values to get the significant amount of performance. For SVM we have taken polynomial kernel with degree 3.

Accuracy of SVM and NB has been shown in Table 2.

Table 2 Accuracy measures of SVM and NB classifiers

Above table shows that the accuracy of SVM is more than 74 % for all datasets. It slightly improves or just fluctuates by increasing the number of terms from 25 to 200. But NB accuracy increases as we increase the number of terms from 25 to 200. For Eclipse project version 2 and subversions, SVM accuracy slightly increased from 74.66 to 74.76 % for variation in number of terms from 25 to 75. After this it keeps on fluctuating with a small fraction. For Eclipse project version 3 and subversions, SVM gives highest accuracy i.e., 93.73 %. In OpenOffice Database access product, accuracy slightly increased from 77.94 to 78.27 % for 25 to 100 terms and then it keeps on fluctuating with the increase in number of terms. For OpenOffice Spreadsheet product, accuracy slightly increased from 81.07 to 81.17 % for 25 to 200 terms. In case of OpenOffice Presentation product, accuracy slightly increased from 79.74 to 80.26 % for 25 to 175 terms. These results show that there is no significant increase in accuracy with the increase in number of terms after 100 terms.

Accuracy of NB classifier is not significant. The reason behind this is the large degree of class overlapping in our datasets as features (terms) we used for priority prediction belongs to more than one priority class. Denil and Trappenberg (2010) shown that SVM is not sensitive to the overlapping problem. It is able to achieve performance comparable to the optimal classifier in the presence of overlapping classes.

Accuracy of K-NN for K = 1 to 5 has been shown in Table 3.

Table 3 Accuracy measure (in %) of K-NN classifier with K = 1–5

Accuracy of K-NN increases slightly with the increase in value of K from 1 to 5 and decreases slightly with increase in number of terms from 25 to 200. For Eclipse platform product version 2 accuracy for K = 1, is 69.48 % which goes on decreasing to 64.25 % with increase in number of terms from 25 to 200. On the other hand for increasing value of K, the accuracy increased from 69.48 to 73.98 %. For version 3 accuracy for K = 1, is 91.70 % which goes on decreasing to 89.61 % with increase in number of terms from 25 to 200. On the other hand for increasing value of K, the accuracy increased up to 93.70 %. In all cases for all datasets accuracy is above 64 %. The maximum accuracy of the classifier with present datasets is 93.70 % for 25 and top 200 terms with K = 5. Same trend is followed for other datasets. This shows that there is no effect of number of terms on the accuracy in a great extent.

Accuracy of NNET has been shown in Table 4.

Table 4 Accuracy measure (in %) of Neural network for 100–300 training cycles

Accuracy of Neural Network decreases with increase in number of training cycles from 100 to 300 in almost all datasets. We get maximum number of highest accuracy at 100 training cycles. Accuracy in case of Eclipse version 2 and 3 are 74.75 and 93.77 %for 100 training cycles and 100 terms. For OpenOffice project highest accuracy is 79.06, 81.31 and 81.16 %.

In the answer of research question 1 we concluded that the performance of Neural Network in terms of accuracy is better than SVM which is better than K-NN. Accuracy of Naive bayes is the lowest of all.

Graphical presentation of performance measure accuracy for all the classifier with varying number of terms i.e. from 25 to 200 across all datasets has been shown in Fig. 2, 3, 4, 5, 6.

Fig. 2
figure 2

Accuracy measure for Eclipse Ver2

Fig. 3
figure 3

Accuracy measure for Eclipse Ver3

Fig. 4
figure 4

Accuracy measure for Database access

Fig. 5
figure 5

Accuracy measure for Presentation

Fig. 6
figure 6

Accuracy measure for Spreadsheet

Figure 2 shows that SVM and NNET accuracy remain invariant with increase in number of terms whereas, K-NN accuracy goes down slightly with increase in number of terms. Figure 3 shows that all the three classifiers: SVM, NNET and K-NN show no significant effect of increase in number of terms on accuracy. From Fig. 4 it’s clear that accuracy of SVM and NNET remain invariant with increase in number of terms whereas, K-NN accuracy goes down slightly with increase in number of terms. Figure 5 shows that accuracy of NNET goes down with increase in number of terms from 25 to 125. After that it increases for 150 terms and then again decreases with increase in number of terms. K-NN accuracy goes up for 75 terms and then shows a fluctuating behavior with increase in number of terms. SVM accuracy shows a very slow increment with increase in number of terms. From Fig. 6 we see that K-NN and NNET accuracy show a fluctuating behavior with increase in number of terms. SVM accuracy shows an invariant behavior with increase in number of terms.

The accuracy trends with varying number of terms show that the globally distributed users of open source software use a set of technical terms (words) related with software to report a bug. This set of prominent terms is fixed and is not very large in open source software. For some software it’s of 25 terms only and adding more terms to this will not show improvement in accuracy. This shows that bug reporting follows a systematic approach in open source software.

For SVM classifier in Eclipse version 2 dataset, P1 class precision increased from 34.62 to 46.88 % for 25 to 75 terms and after that it starts decreasing for 100 to 200 terms. For P2 class, precision increases from 14.29 to 26.79 % for 25 to 75 terms, then it starts decreasing and again increases to maximum value i.e., 27.03 % at 175 terms. Precision of P3 class increases from 74.94 to 75.24 % till 100 terms and then starts decreasing with increase in number of terms. Precision increases from 65.45 to 72.55 % till 75 terms in case of P4 class. P5 class precision increases from 0.00 to 54.55 % for 25 to 75 terms, after that it decreases. In case of Eclipse version 3 dataset, P1 class precision increased from 0.00 to 73.33 % for 25 to 125 terms and after that it starts decreasing for 150 to 200 terms. For P2 class, precision increases to 17.65 % at 100 terms and then fluctuates and reaches to 18.75 % for 200 terms. Precision of P3 class increases from 93.81 to 93.90 % till 100 terms and then starts fluctuating with increase in number of terms. Precision increases from 55.88 to 58.33 % till 100 terms in case of P4 class. P5 class precision, increases from 0.00 to 50.00 % for 25 to 75 terms, then it fluctuates.

In case of OpenOffice Database Access dataset, P1 class precision increased from 75.00 to 87.50 % for 75 terms and after that it starts fluctuating. For P2 class, maximum precision is 61.04 % at 75 terms. Precision of P3 class increases to 78.88 % till 75 terms. Precision increases to 57.14 % till 200 terms in case of P4 class. P5 class precision, increases from 0.00 to 20.00 % for 50 terms. In case of OpenOffice Spreadsheet dataset, P1 class precision increased to 62.50 % for 75 terms and after that it fluctuates. For P2 class, maximum precision is 73.44 % at 25 terms. Precision of P3 class increases to 81.51 % till 200 terms. Precision increases to 83.33 % till 200 terms in case of P4 class. P5 class precision, increases to 57.14 % for 175 terms. In case of OpenOffice Presentation dataset, P1 class maximum precision is 83.33 % for 25 terms and after that it fluctuates. For P2 class, maximum precision is 69.84 % at 150 terms. Precision of P3 class increases to 80.88 % till 50 terms. Precision increases to 66.67 % till 50 terms in case of P4 class. P5 class precision remains 0.00 % for all number of terms.

For 5 different priority classes and 5 datasets we have 25 maximum precision values. Out of which, for P5 class, in one case precision value remains 0 for all the number of terms. 18 cases of maximum precision we get for range of 25 to 100 terms and 6 cases for range of 125 to 200 terms. This concludes that 25 to 100 terms are sufficient to get the maximum precision for a class.

In case of Eclipse version 2 dataset, P1 class f-measure, increased from 1.91 to 6.63 % for 25 to 100 terms and after that it starts decreasing for 125 to 200 terms. For P2 class, f-measure increases from 0.41 to 2.68 % for 25 to 175 terms. F-measure of P3 class increases from 85.52 to 85.55 % till 75 terms and then starts decreasing with increase in number of terms. F-measure increases from 16.94 to 17.58 % till 75 terms in case of P4 class. P5 class f-measure increases from 0.00 to 5.74 % for 25 to 100 terms, after that it decreases. In case of Eclipse version 3 dataset, P1 class f-measure increased from 0.00 to 6.29 % for 25 to 100 terms and after that it starts decreasing for 125 to 200 terms. For P2 class, f-measure increases from 0.00 to 1.20 % at 100 terms. F-measure of P3 class is 96.76 % at 25 terms and then starts fluctuating with increase in number of terms. F-measure increases from 10.74 to 14.08 % till 200 terms in case of P4 class. P5 class f-measure, increases from 0.00 to 9.34 % for 25 to 100 terms, after that it fluctuates. In case of OpenOffice Database Access dataset, P1 class f-measure is 31.25 % at 25 terms and after that it starts fluctuating. For P2 class, maximum f-measure is 17.13 % at 25 terms. F-measure of P3 class increases to 87.72 % till 100 terms. F-measure increases to 6.22 % till 200 terms in case of P4 class. P5 class f-measure is 4.65 % for 25 terms. In case of OpenOffice Spreadsheet dataset, P1 class f-measure increased to 11.12 % for 75 terms and after that it fluctuates. For P2 class, maximum f-measure is 17.96 % at 100 terms. F-measure of P3 class increases to 89.48 % till 200 terms. F-measure increases to 9.35 % till 200 terms in case of P4 class. P5 class f-measure, increases to 6.61 % for 175 terms. In case of OpenOffice Presentation dataset, P1 class maximum f-measure is 24.66 % for 100 terms and after that it fluctuates. For P2 class, maximum f-measure is 28.61 % at 50 terms. F-measure of P3 class increases to 88.91 % till 175 terms. F-measure increases to 4.30 % till 50 terms in case of P4 class. P5 class f-measure remains 0.00 % for all number of terms.

18 cases of maximum f-measure we get for range of 25 to 100 terms and 6 cases for range of 125 to 200 terms. This concludes that 25 to 100 terms are sufficient to get the maximum f-measure for a class.

For K-NN classifier precision increases with increase in value of K from 1 to 5 and decreases with increase in number of terms from 25 to 200 across all datasets. F-measure shows fluctuating behavior with increase in number of terms, but shows an increasing trend for increase in value of K from 1 to 5. Maximum F-measure for P3 class is 85.16 for K = 5 and 25 terms. For value of K equal to 4 and 5, we got maximum highest values of precision and F-measures with less number of terms.

For Neural Network classifier, precision increases with increase in number of terms in most of the cases. Highest precision for Eclipse version 2 is 39.29, 36.36, 75.26, 73.47 and 50 % for P1 to P5 class. Highest precision for Eclipse version 3 is 40.00, 16.67, 93.86, 62.50 and 0 % for P1 to P5 class. Precision for OpenOffice Database Access is 100.00, 64.18, 82.12, 100 and 0 % for P1 to P5 class. Precision for OpenOffice Spreadsheet is 63.64, 53.33, 82.60, 100 and 20 % for P1 to P5 priority level. Highest precision for OpenOffice Presentation is 100.00, 60.80, 85.52, 50 and 0 % for P1 to P5 priority level. We get highest precision, recall and f-measure for P3 class.

For NB classifier in eclipse version 2 dataset, P1 class maximum precision is 20.40 % at 50 terms. For P2 class, precision is 26.68 % for 50 terms. Precision of P3 class is 89.33 % at 75 terms. Precision is 8.21 % at 25 terms in case of P4 class. P5 class precision increases from 2.35 to 2.72 % for 25 to 100 terms, after that it decreases. In case of Eclipse version 3 dataset, P1 class maximum precision is 2.56 % at 200 terms. For P2 class, precision is 12.48 % for 150 terms. Precision of P3 class is 98.33 % at 75 terms. Precision is 8.91 % at 25 terms in case of P4 class. P5 class precision increases from 0.52 to 0.66 % for 25 to 200 terms. In case of OpenOffice Database Access dataset, P1 class maximum precision is 19.39 % at 25 terms. For P2 class, precision is 36.94 % for 25 terms. Precision of P3 class is 93.36 % at 100 terms. Precision is 18.46 % at 100 terms in case of P4 class. P5 class precision increases from 1.11 to 1.25 % for 25 to 125 terms. In case of OpenOffice Spreadsheet dataset, P1 class maximum precision is 5.15 % at 25 terms. For P2 class, precision is 34.08 % for 50 terms. Precision of P3 class is 89.66 % at 150 terms. Precision is 34.04 % at 25 terms in case of P4 class. P5 class precision increases from 2.35 to 3.33 % for 25 to 200 terms. In case of OpenOffice Presentation dataset, P1 class maximum precision is 7.07 % at 50 terms. For P2 class, precision is 50.00 % for 25 terms. Precision of P3 class is 91.51 % at 75 terms. Precision is 4.71 % at 75 terms in case of P4 class. P5 class precision increases from 1.15 to 1.41 % for 25 to 100 terms.

19 cases of maximum precision we get for range of 25 to 100 terms and 6 cases for range of 125 to 200 terms. This concludes that 25 to 100 terms are sufficient to get the maximum precision for a class.

For NB, in case of Eclipse version 2 dataset, class P1 f-measure, is 24.90 % at 75 terms. For P2 class, f-measure is 8.48 % for 100 terms. F-measure of P3 class is 7.74 % at 200 terms. F-measure is 12.72 % for 75 terms in case of P4 class. P5 class f-measure increases from 2.35 to 5.25 % for 25 to 100 terms, after that it decreases. In case of Eclipse version 3 dataset, P1 class f-measure, is 4.73 % at 200 terms. For P2 class, f-measure is 11.48 % for 175 terms. F-measure of P3 class is 11.82 % at 200 terms. F-measure is 13.77 % for 25 terms in case of P4 class. P5 class f-measure increases from 0.52 to 1.31 % for 25 to 100 terms, after that it decreases. In case of OpenOffice Database Access dataset, P1 class f-measure, is 26.55 % at 25 terms. For P2 class, f-measure is 24.75 % for 150 terms. F-measure of P3 class is 19.74 % at 200 terms. F-measure is 23.55 % for 100 terms in case of P4 class. P5 class f-measure increases from 1.11 to 2.44 % for 25 to 125 terms, after that it decreases. In case of OpenOffice Spreadsheet dataset, P1 class f-measure, is 8.58 % at 25 terms. For P2 class, f-measure is 31.26 % for 200 terms. F-measure of P3 class is 9.94 % at 200 terms. F-measure is 19.59 % for 150 terms in case of P4 class. P5 class f-measure increases from 2.35 to 6.08 % for 25 to 200 terms. In case of OpenOffice Presentation dataset, P1 class f-measure is 12.10 % at 75 terms. For P2 class, f-measure is 31.89 % for 100 terms. F-measure of P3 class is 7.74 % at 200 terms. F-measure is 7.26 % for 100 terms in case of P4 class. P5 class f-measure increases from 1.15 to 2.76 % for 25 to 100 terms, after that it decreases.

12 cases of maximum f-measure we get for range of 25 to 100 terms and 13 cases for range of 125 to 200 terms. This concludes that 125 to 200 terms are sufficient to get the maximum f-measure.

A Graphical presentation for performance of the different classifiers across all datasets on the basis of f-measure has been shown in Fig. 7.

Fig. 7
figure 7

Classifiers performance using f-measure

We have counted maximum F-measures for each priority class for each technique. After adding all these values for each technique, we plotted the graph for f-measure. Result shows that performance in terms of f-measure is 41, 29, 24 and 6 % for SVM, NNET, NB and K-NN respectively.

In answer of research question 2, we have concluded that 100 top terms are sufficient to get optimum performance in terms of accuracy, precision and f-measure across all machine learning techniques. We found that for K = 5 for K-NN and 100 training cycles for Neural Network we get optimum performance.

Scenario 2 From scenario 1 we concluded that K-NN and NNET give optimum results for K = 5 and training cycles = 100. So, we ran Scenario 2 for these values. Table 5 summarizes accuracy for cross project priority prediction for 4 different techniques.

Table 5 Accuracy (in %) of cross validated projects

Cross project validation for Eclipse version 2 as training set and version 3 as testing set gives 90.38, 92.05 and 93.60 % accuracy for K-NN (K = 5), SVM and Neural Network with 100 terms. It is clear from the empirical evidence that cross validation works well within same domain. We also found that cross project validation is bidirectional with accuracy above 72 %.

In answer of the research question 3, we concluded that cross project validation across domain gives accuracy above 72 % for SVM, K-NN and Neural Network. We found that cross projects validation working well with significant accuracy in priority prediction.

The value of f-measure for machine learning techniques namely SVM, K-NN, Neural Network and NB varies in the range of 85.25 to 96.67, 84.49 to 95.69, 84.97 to 96.76 and 0.83 to 5.21 % across all datasets for priority level 3. Due to insufficient number of reports in case of priority level 1, 2, 4 and 5 we are not getting desired performance of different machine learning techniques. This is one of the problems in multi-class prediction.

Scenario 3 Table 6, 7, 8, 9 summarize the accuracy for different training candidates for the studied learning techniques.

Table 6 Accuracy (in  %) of SVM model in different datasets
Table 7 Accuracy (in %) of K-NN model in different datasets
Table 8 Accuracy (in  %) of NB Model in different datasets
Table 9 Accuracy (in %) of NNET model in different datasets

From these tables, the accuracy of V3 dataset has been found more than 90 % in all the techniques except NB classifier irrespective of the training datasets.

In answer of the research question 4, we conclude that for combined training datasets we get accuracy which is above 73 % for all cases which is not a significant improvement over single training dataset. In NB we are getting very low accuracy.

The value of f-measure for machine learning techniques namely SVM, K-NN, Neural Network and NB varies in the range of 85.08 to 96.68, 84.92 to 96.70, 85.18 to 96.77 and 0.00 to 1.27 % across all data sets for priority level 3. Due to insufficient number of reports in case of priority level 1, 2, 4 and 5 we are not getting desired performance of different machine learning techniques. This is one of the problems in multi-class prediction where we have imbalance data sets.

7 Threats to validity

Following are the factors that affect the validity of our approach:

7.1 Construct validity

The accuracy of classifier depends on the summary text; if summary does not contain appropriate terms to be learned then the result will be in wrongly predicted class. Number of bug reports in P3 class is more than number of bug reports in other classes. This makes classifier biased towards P3 class.

7.2 Internal validity

We have only taken summary feature of bug report. Other features can also be considered for prediction.

7.3 External validity

We have considered Eclipse and OpenOffice projects which are open source. We can consider closed source software also.

7.4 Reliability

Rapid Miner (http://www.rapid-i.com/) tool has been used in this paper for data pre-processing, model building and tenfold cross validation. The increasing use of Rapid Miner tool in data mining community confirms the reliability of the tool.

8 Conclusion

In response to the Scenario 1 and subsequent experimental setup, following conclusions have been drawn with the variation of number of terms from 25 to 200:

  • SVM, Neural Network and K-NN techniques are applicable to predict the priority level of reported bug in open source projects.

  • SVM and Neural Network give overall higher accuracy in comparison of K-NN and Naive Bayes for all datasets.

  • SVM performance in terms of accuracy had no significant improvement with increase in number of terms.

  • SVM performance in terms of precision and f-measure slightly improved by increasing the number of terms from 25 to 100.We found fewer cases of increase in range of 125 to 200 terms.

  • K-NN performance in terms of accuracy, precision and f-measure slightly decreased by increasing the number of terms from 25 to 200 and increased by increasing the value of K from 1 to 5.

  • NB performance in terms of accuracy increases with increase in number of terms from 25 to 200. Its precision increases in range of 25 to 100 terms. We found fewer cases of increase in range of 125 to 200 terms.

  • Neural Network gives higher accuracy for 100 training cycles, which decreases with increase in training cycles in almost all cases. We get highest precision, recall and f-measure for P3 priority level.

  • Recall of SVM is high for class having large number of reports and low for class having less number of reports. Whereas in Naive Bayes, recall is high for class having less number of reports and low for class having large number of reports.

  • Machine learning techniques performed well in terms of precision only in case of priority level 3 due the fact that it has sufficient number of bug reports.

  • The value of performance measure, precision has shown significant improvement with increase in number of reports across all the techniques. The value of precision increases from 25 to 100 terms, after that it starts decreasing with increase in number of terms.

  • Automation of bug triage by using priority prediction will save time and resources. It will help in solving higher priority bugs within given time period.

  • We found that the SVM and Neural Network are better than K-NN and K-NN is better than NB.

In response to the Scenario 2 and subsequent experimental setup, following conclusions have been drawn:

  • The accuracy in cross project context is better than within project.

  • Cross project validation for different cases are working with accuracy level more than 72 % except for NB learner.

Finally we concluded that historical data of other projects developed in open source environment is better priority predictor and priority prediction in cross project context is working well.

As a result of scenario 3, we found that combined training datasets from other projects for training working well but does not show significant improvement over single training data set. In the non-availability of historical data, combined training datasets from other projects provide an acceptable performance. In our case we get accuracy above 73 % for all cases except NB learner.

In future, we will work on the following agendas:

  • The current empirical study can be carried out on more open source and closed source projects to validate priority predictions in cross project context.

  • The study can be extended to determine the optimum number of bug reports as well as optimum number of features/terms required to get the best performance.

  • Impact of imbalance data on performance of the classifier can be considered.

  • Training data selection method used in this paper gives an exponential increase in training datasets to be considered to find best one. We will try to find a similarity measure between training and testing projects using fuzzy logic.