Keywords

1 Introduction

Software bug reports can be represented as defects or errors’ descriptions identified by software testers or users. These are generated due to the reporting of the same defect by many users. These duplicates cost futile effort in identification and handling. Developers, QA personnel and triagers consider duplicate bug reports as a concern. It is crucial to detect duplicate bug reports as it helps in reduced triaging efforts. The effort needed for identifying duplicate reports can be determined by the textual similarity between previous issues and new report [13]. Various approaches have been proposed to automate duplicate bug reports’ detection. In early approaches, NLP [16], machine learning [1, 10, 20], information retrieval [19], topic analysis [1, 9], deep learning [3], and combination of models [3, 23] have been applied. Figure 1 shows the hierarchy of most widely used sparse and dense vector semantics [7].

Fig. 1.
figure 1

Vector Representation in NLP

Our study has combined sparse and dense vector representation approaches to generate amalgamated models for duplicate bug reports’ detection. The one or more models from LDA, TF-IDF, GloVe and Word2Vec are combined to create amalgamated similarity scores. The similarity score presents the duplicate (most similar) bug reports to bug triaging team. The proposed models takes into consideration textual information (description); and non-textual information (product and component) of the bug reports. TF-IDF signifies documents’ relationships [16]; the distributional semantic models, Word2Vec and GloVe, use vectors that keep track of the contexts, e.g., co-occurring words.

LDA presents relationships between documents by transforming into a lower dimensional space. An amalgamated score is computed by merging the similarity scores from individual approaches. Thus, this score makes the basis for top k duplicate bug recommendations. The empirical evaluation has been performed on three datasets, namely, Apache, Eclipse, and KDE [17] with bug reports as discussed in Table 1. The effectiveness of the proposed approach is evaluated by three established performance metrics, viz. mean average precision (MAP), recall-rate@k, and mean reciprocal rank (MRR).

This study investigates and contributes into the following items:

  • An empirical analysis of amalgamated models to rank duplicate bug reports.

  • Effectiveness of amalgamation of models.

  • Statistical significance and effect size of proposed models.

The paper has been divided into eight sections. The following section describes related work in detail. In third section, dataset and steps followed in pre-processing as given in [19] have been explained. The fourth section elaborates the methodology followed. The fifth section provides the insights into evaluation metrics. The sixth section discusses the results generated from the proposed models. The seventh section presents the threats to validity. In the final section, the paper is concluded and directions for future work are provided.

2 Related Work

Extensive research has been conducted in the area of detecting the duplicate bug reports automatically. Several methods have been developed focusing on these research areas [1, 5, 10, 23]. A TF-IDF model has been proposed by modeling a bug report as a vector to compute textual features similarity [12]. An approach based on n-grams has been applied for duplicate detection [21]. All of the above methods are primary term-based methods and can diagnose the lexical duplicate bug reports. In addition to using textual information from the bug reports, the researchers have witnessed that additional features also support in the classification or identification of duplicates bug report.

The first study that combined the textual features and non-textual features derived from duplicate reports was presented by Jalbert and Weimer [6]. In year 2008, the execution traces were combined with textual information by Wang et al. [22]. In recent times, software engineering has witnessed the shift in the research focus towards the usage of Vector space models (VSMs). Word embedding is one of the most popular representation of document vocabulary. A method was proposed to use software dictionaries and word list to extract the implicit context of each issue report [1].

It has also been researched that Latent Dirichlet Allocation (LDA) provides great potential for detecting duplicate bug reports [5, 9]. A combination of LDA and n-gram algorithm outperforms the state-of-the-art methods has been suggested Zou et al. [24]. Recently, deep learning technique for duplicate bug reports has been proposed by Budhiraja et al. [3]. Although in prior research many models have been developed and a recent trend has been witnessed to ensemble the various models. There exists no research which amalgamated the statistical, contextual, and semantic models to identify duplicate bug reports.

3 Dataset and Pre-processing

3.1 Dataset

A collection of bug reports that are publicly available for research purposes has been proposed by Sedat et al. [17]. The repositoryFootnote 1, presented three defect datasets extracted from Bugzilla in “.csv” format [17]. It contains the datasets for open source software projects: Apache, Eclipse, and KDE. The datasets contain information about approximately 914 thousands of defect reports over a period of 18 years (1999–2017) to capture the inter-relationships among duplicate defects. Descriptive statistics are illustrated in Table 1. The dataset contains two categories of feature viz. textual and non-textual. The textual information is description given by the users about the bug i.e. “Short_desc”. The non-textual information is presented by the features viz. “Product” and “Component”, “Priority”, “Bug severity”, “Version”, “Bug status”, “current status” and “duplicate list”. From these non-textual features “Product” and “Component” are used as filter, and “duplicate list” is used to create the ground truth for evaluation of the metrics.

Table 1. Dataset description

3.2 Pre-processing of Textual Features

Pre-processing and term-filtering were used to prepare the corpus from the textual features. In further processing steps, the sentences, words and characters identified in pre-processing were converted into tokens and corpus was prepared. The corpus preparation included conversion into lower case, word normalisation, elimination of punctuation characters, and lemmatization.

4 Methodology

The flowchart shown in Fig. 2 depicts the approach followed in this paper.

Fig. 2.
figure 2

Overall Methodology

4.1 Latent Dirichlet Allocation

The bug reports textual information is the perfect example of the unstructured data as the content is written in natural language and LDA has emerged as efficient approach for pattern identification from unstructured data [18]. In this paper, LDA has been applied for querying the corpus data and identifying the latent patterns and the heuristic parameters proposed by Arun et al. [2] and Cao et al. [4] were used for deciding the topic count.

4.2 Term Frequency-Inverse Document Frequency

The main idea behind the Term Frequency-Inverse Document Frequency (TF-IDF) is that the count of a term’s occurrence in documents may be used to differentiate the documents. The weighted scheme for TF-IDF was adopted for representing one entity’s significance relative to the other entities in the prepared corpus. The weight of an entity increases proportionally with a count of occurrences for a word in the document.

TF is document’s local component measuring a normalized frequency of term occurrence; whereas the global component is represented by the inverse document frequency (IDF), i.e., \(log[((1+n_d) /(1+ df{i,j})] +1\).

4.3 Word2Vec

Word2Vec is capable of capturing context of a word in a document, semantic and syntactic similarity, relation with other words, etc. Two variants of Word2Vec models namely, continuous bag-of-words (CBOW) and skip-gram are available. Both are capable to capture interactions between a centered word and its context words differently.

For a word vector \(\hat{r}\) (predicted) and a word vector \(w_t\) (target), softmax function is applied to find the probability of the target word as given in Eq. 1.

$$\begin{aligned} P(w_t | \hat{r}) = \frac{exp(w_t, \hat{r})}{\sum _{w \in W} exp(w^{'},\hat{r}) } \end{aligned}$$
(1)

Here W is the set of all target word vectors, where \(exp(w_t,\hat{r})\) computes the compatibility of the target word \(w_t\) with the context \(\hat{r}\). In this paper, gensim implementation of word2vec (skip-gram) pre-trained Google News corpus (3 billion running words) word vector model (3 million 300-dimension English word vectors) is used.

4.4 Global Vectors for Word Representation

Global Vectors for word representation (GloVe) is an unsupervised learning algorithm that combines the features of two model families, namely the global matrix factorization and local context window methods [8]. In this paper, GloVe used Google News pre-trained model to reduce the error between the dot product of (any two) word embedding vectors to the log of the co-occurrence probability. GloVe is based on matrix factorization on the word-context matrix. The model can be represented as in Eq. 2. In this, w and \(\tilde{w}\) are word vectors.

$$\begin{aligned} F(w_i,w_j,\tilde{w}_k) = \frac{P_{ik}}{P_{jk}} \end{aligned}$$
(2)

Where i, j, and k are three words and the ratio \(P_{ik}/P_{jk}\) depends upon them.

4.5 Proposed Amalgamated Model

It has been identified that even the established similarity recommendation models such as NextBug [15] does not produce optimal and accurate results. Therefore, the current study created amalgamated models those merge one or more approaches viz. LDA, Word2Vec, GloVe and TF-IDF. The similarity scores vector (\(S_1\), \(S_2\), \(S_3\), \(S_4\)) for k most similar bug reports is captured from individual approaches as shown in Fig. 2. Since the weights obtained for individual method have their own significance; therefore a heuristic ranking method is used to combine and create a universal rank all the results. The ranking approach assigns new weights to each element of the resultant similarity scores vector from the individual approach and assign it equal to the inverse of its position in the vector as in Eq. 3.

$$\begin{aligned} R_i = \frac{1}{Position_i} \end{aligned}$$
(3)

Once all ranks are obtained for each bug report and for each model selected, the amalgamated score is generated by summation of the ranks generated as given in Eq. 4, the ranks would be zero for left out models. It creates a vector of elements less than or equals to nk, where k is number of duplicate bug reports returned from each model and n is number of models being combined.

$$\begin{aligned} S = (R_1+R_2+R_3+R_4)*PC \end{aligned}$$
(4)

Where S is amalgamated score (rank) of each returned bug report and \(R_1\), \(R_2\), \(R_3\), and \(R_4\) are the ranks returned from LDA, Word2Vec, GloVe, and TF-IDF, respectively as given in Eq. 3. Here PC is the product & component score and works as a filter. For instance, if two bug reports belong to same product and component then their similarity depend on the document similarity score. But if they belong to different product and component, then they are unlikely to be similar even if their document similarity score is high, thus made zero.

5 Evaluation Metrics

The evaluation metrics used to evaluate the one or more amalgamation of models are: recall-rate@k; mean average precision (MAP); and mean reciprocal rank (MRR). These metrics have been frequently used in recommendation systems to solve software engineering tasks [5, 6, 9, 15, 19].

5.1 Recall-Rate@k

Recall-rate is used to check the usefulness of top k recommendation. For a query bug q, it is defined as given in Eq. 5 as suggested by previous researchers [5, 19, 23].

$$\begin{aligned} RR(q)= {\left\{ \begin{array}{ll} 1 ,&{} \text {if } if S(q) \cap R(q) \ne 0 \\ 0, &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(5)

Given a query bug q, S(q) is ground truth and R(q) represents the set of top-k recommendations from a recommendation system.

5.2 Mean Average Precision (MAP)

MAP is defined as the mean of the Average Precision (AvgP) values obtained for all the evaluation queries given in \(MAP = \sum _{q=1}^{|Q|}\frac{AvgP(q)}{|Q|}\). In this equation, Q is number of queries in the test set.

5.3 Mean Reciprocal Rank (MRR)

Mean Reciprocal Rank (MRR) is calculated from the reciprocal rank values of queries. \(MRR(q) = \sum _{i=1}^{|Q|}Reciprocal Rank(i)\) calculates the mean reciprocal rank and RR is calculated as in \(Reciprocal Rank(q) = \frac{1}{index_q}\).

6 Results and Discussion

This section presents the results of the empirical evaluation. For evaluation of results, we used a Google ColabFootnote 2 machine with specifications as RAM: 24 GB Available; and Disk:  320 GB.

The amalgamated models compares the incoming query bug report against the already existing resolved bug report database and return the top-k duplicate bugs. The current research implements the algorithms in Python 3.5 and used “nltk”, “sklearn”, “gensim” [14] packages for model implementation. The default values of the parameters of the algorithms were used. The values of k has been taken as 1, 5, 10, 20, 30, and 50 to investigate the effectiveness of proposed approach.

The current study has investigated the proposed models in comparison with the other combination of established approaches for duplicate bug report recommendation. For the empirical validation of the results, the developed models have been applied to open bug report data (discussed in Sect. 3) consisting of three datasets of bug reports. The datasets were divided into train and test data. In the open source software (OSS) bug repository datasets, one of the column contained the actual duplicate bug list i.e. if a bug report actually have duplicate bugs then the list is non-empty otherwise it is empty (‘NA’). This list worked as ground truth to validate the evaluation parameters. All the bug reports with duplicate bug list are considered as test dataset for validation of the amalgamated models. The number of bug reports for test dataset for Apache, Eclipse, and KDE projects were 2,518, 34,316, and 30,377, respectively. The training dataset was used to convert the existing textual information into the vector representation for the models. The test data was used to detect the duplicate bug reports from the train dataset considered resolved. This helped to identify the duplicate bug reports and evaluate the models.

Table 2. Mean average precision of individual and amalgamated models using all dataset.

6.1 Empirical Analysis

The empirical analysis of the proposed ensemble model has been performed on OSS datasets. The models take textual information from training dataset and create vocabulary to be used for finding the duplicates of test bug reports.

Apache Dataset. Apache dataset is smallest dataset of three datasets and contains 44,049 bug reports. These bug reports are generated for 35 products and 350 components. Figures 3(a) and 3(b) show that the amalgamation of models produces more effective results than the individual established approaches. Table 2 represents MAP values for the models. For the results, it is revealed that not all combinations produces good results.

Eclipse Dataset. The dataset of Eclipse contained 503,935 bug reports, and 31,811 distinct ids. It includes 232 products and 1486 components bug reports. Due to large dataset the random sampling of the full dataset was performed to select 10% of the dataset. The values of recall rate and MRR are presented in Figs. 3(c) and 3(d) respectively. The results obtained reveal that the amalgamated score has better value as compared to the scores obtained from individual approaches.

Fig. 3.
figure 3

Performance of (a)–(b) Apache dataset, (c)–(d) Eclipse dataset, (e)–(f) KDE dataset

KDE Dataset. This dataset contains 365,893 bug reports of 584 products out of which 2054 were used. Due to large dataset the random sampling of the full dataset was performed to select 10% of the dataset. The evaluation metrics obtained from this dataset are depicted in Fig. 3(e) and 3(f).

6.2 Effectiveness of Amalgamation of Models

The results have demonstrated the superiority of the amalgamated models to identify the duplicate report as compared to individual approaches. Figure 3 shows the comparative performance of the proposed approach and the established approaches with parameter k varying for all the datasets. Further, it has been revealed that for two datasets Apache and KDE, the amalgamated model (TF-IDF + Word2Vec + LDA) produced the best results. Whereas for Ecilpse dataset a amalgamated model (TF-IDF + LDA) generated better than model (TF-IDF + Word2Vec + LDA). Another, conclusion from the results is that Word2Vec individually is also very powerful to detect the duplicate reports. This study proposes the amalgamated model of TF-IDF + Word2Vec + LDA, that outperform other amalgamated models. It has also been concluded that Word2Vec and its combination produces better results as compared to GloVe.

6.3 Statistical Significance and Effect Size

To establish the obtained results of the proposed model, we performed the Wilcoxon signed-rank statistical test to compute the p-value, and measured the Cliff’s Delta measure [11], and Spearman correlation. Table 3(a) depicts the interpretation of Cliff’s Delta measure. By performing the Shapiro-Wilk test, the normality of the results was identified. Since it turned out to be non-Gaussian, non-parametric test Spearman correlation was applied to find out the relationship between the results of different approaches.

Table 3. Statistical significance and effect size

Table 3(b) presents the p-value, Cliff’s Delta measure and Spearman’s correlation coefficient of amalgamated (TF-IDF + Word2Vec + LDA) model with TF-IDF in terms of two metrics for Apache dataset and KDE, respectively. The TF-IDF model has been compared with the amalgamated approach as TF-IDF has been presented as benchmark model in most of the previous studies. Table 3(b) presents that the results have a positive correlation, whereas there is a medium or large effect size, which means improvement is happening by amalgamation of models.

7 Threats to Validity

Internal Validity. The dataset repository contains the bug reports that contains dataset till the year 2017. The threat is that the size of textual information is small for each bug report. But, the current work applied the well-established methods of natural language processing to preparing the corpus from these large datasets. Therefore, we believe that there would not be significant threats to internal validity. While using LDA, a bias may have been introduced due to the choice of hyper-parameter values and the optimal number of topic solutions. However, to mitigate this, the selection of the optimal number of topic solutions was done by following a heuristic approach as suggested by Arun et al. [2] and Cao et al. [4]. External Validity. The generalization of results may be another limitation of this study. The similarity score was computed by following a number of steps and each of these steps has a significant impact on the results. However, verification of results is performed using open source datasets to achieve enough generalization.

8 Conclusion

The main contribution of this paper is an attempt to amalgamate the established natural language models for duplicate bug recommendation using bug textual information and non-textual information (product and component). The proposed amalgamated model combines the similarity scores from different models namely LDA, TF-IDF, Word2Vec, and GloVe. The empirical evaluation has been performed on the open datasets from three large open source software projects, namely, Apache, KDE and Eclipse. From the validation, it is evident that for Apache dataset the value of MAP rate increased from 0.076 to 0.163, which is better as compared to the other models. This holds true for all three datasets as shown in experimental results. Similarly, the values of MRR for amalgamated models is also high relative to the other individual models. Thus, it can be concluded that amalgamated approaches achieves better performance than individual approaches for duplicate bug recommendation. This study proposes the amalgamated model (TF-IDF + Word2Vec + LDA), that outperform other amalgamated models.

The future scope of current work is to develop a python package that allows the user to select the individual models and their amalgamation with other models on a given dataset. This would also allow the user to select combination of textual and non-textual features from dataset for duplicate bug detection.