1 Introduction

Studies show that software developers on average spend about 19% of their development time in web search. On the web, they frequently look for relevant code snippets for their tasks (Brandt et al. 2009). Online code search engines–Open Hub, Koders, GitHub and Krugle– index thousands of large open source projects, and these projects are a potential source for such code snippets (McMillan et al. 2011). However, these traditional code search engines mostly employ keyword matching. Hence, they often do not perform well with unstructured natural language (NL) queries due to vocabulary mismatch between NL query and source code (Bajracharya and Lopes 2012b). They retrieve code snippets based on lexical similarity between a search query and the project source code. That means, these engines require the queries to be carefully designed by the users and to contain relevant API references. Unfortunately, preparing an effective search query that contains information on relevant APIs is not only challenging but also time-consuming for the developers (Kevic and Fritz 2014a; Brandt et al. 2009). A previous study (Kevic and Fritz 2014a) also suggested that on average, developers regardless of their experience levels performed poorly in coming up with good search terms for code search. Thus, an automated technique that complements a natural language query with a list of relevant API classes or methods (i.e., search-engine friendly query) can greatly assist the developers in performing the code search. Our paper addresses this particular research problem–query reformulation with relevant API classes–by exploiting the crowdsourced knowledge stored at Stack Overflow programming Q & A site.

Existing studies on API recommendation accept one or more natural language queries, and return relevant API classes and methods by mining feature request history and API documentations (Thung et al. 2013b), large code repositories (Zhang et al. 2017), API invocation graphs (Chan et al. 2012), library usage patterns (Thung et al. 2013a), code surfing behaviour of the developers and API invocation chains (McMillan et al. 2011). McMillan et al. (2011) first propose Portfolio that recommends relevant API methods for a given code search query and demonstrates their usage from a large codebase. Chan et al. (2012) improve upon Portfolio by employing further sophisticated graph-mining and textual similarity techniques. Thung et al. (2013b) recommend relevant API methods to assist the implementation of an incoming feature request. Although all these techniques perform well in different working contexts, they share a set of limitations and thus fail to address the research problem of our interest. First, each of these techniques (McMillan et al. 2011; Chan et al. 2012; Thung et al. 2013b) exploits lexical similarity measure (e.g., Dice’s coefficients for candidate API selection. This warrants that the search query should be carefully prepared, and it should contain keywords similar to the API names. In other words, the developer should or must possess a certain level of experience with the target APIs to actually use these techniques (Bajracharya and Lopes 2012a). Second, API names and search queries are generally provided by different developers who may use different vocabularies to convey the same concept (Kevic and Fritz 2014b). Furnas et al. (1987) named this the vocabulary mismatch problem. Lexical similarity based techniques often suffer from this problem. Hence, the performance of these techniques is not only limited but also subject to the identifier naming practices adopted in the codebase under study. We thus need a technique that overcomes the above limitations, and recommends relevant or appropriate APIs for natural language queries from a wider vocabulary.

One possible way to tackle the above challenges is to leverage the knowledge or experience of a large technical crowd on the usage of particular API classes and methods. Let us consider a natural language query–“Generating MD5 hash of a Java string.” Now, we mine thousands of Q & A posts from Stack Overflow that discuss relevant APIs for this task and then recommend APIs from them. For instance, the Q & A example in Fig. 1 discusses on how to generate an MD5 hash (Fig. 1a), and the accepted answer (Fig. 1b) suggests that MessageDigest API should be used for the task. Such usage of the API is also recommended by at least 305 technical users from Stack Overflow, which validates the appropriateness of the usage. Our approach is thus generic, language independent, project insensitive, and at the same time, it overcomes the vocabulary mismatch problem suffered by the past studies. One can argue in favour of Google which is often used by the developers for searching code on the web. Unfortunately, recent study (Rahman et al. 2018) shows that developers need to spend more efforts (i.e., two times) in code search than in web search while using Google search engine. In particular, they need to reformulate their queries more frequently and more extensively for the code search. Such finding suggests that the general-purpose web search engines (e.g., Google) might be calibrated for the web pages only, and they perform sub-optimally with the source code, especially due to vocabulary mismatch issues (Hellendoorn and Devanbu 2017; Haiduc and Marcus 2011). Thus, automatic tool supports in the query formulation for code search is still an open research problem that warrants further investigation.

Fig. 1
figure 1

An example of a Stack Overflow question and b its accepted answer

In this paper, we propose a query reformulation technique–RACK–that exploits the associations between query keywords and different API classes used in Stack Overflow and translates a natural language query intended for code search into a set of relevant API classes. First, we motivate our idea of using crowdsourced knowledge for API recommendation with an exploratory study where we analyse 172,043 questions and their accepted answers from Stack Overflow. Second, we construct a keyword-API mapping database using these questions and answers where the keywords (i.e., programming requirements) are extracted from questions and the APIs (i.e., programming solutions) are collected from the corresponding accepted answers. Third, we propose an API recommendation technique that employs three heuristics on keyword-API associations and recommends a ranked list of API classes for a given query for its reformulation. The baseline idea is to capture and learn the responses from millions of technical users (e.g., developers, researchers, programming hobbyists) for different programming problems, and then exploit them for relevant API suggestion. Our technique (1) does not rely on the lexical similarity between query and source code of projects for API selection, and (2) addresses the vocabulary mismatch problems by using a large vocabulary (i.e., 20K) produced by millions of users of Stack Overflow. Thus, it has a great potential for overcoming the challenges faced with the past studies.

An exploratory study with 172,043 Java related Q & A threads (i.e., question + accepted answer) from Stack Overflow shows that (1) each answer uses at least two different API classes on average (RQ1), and (2) about 65% of the classes from each of the 11 core Java API packages are used in these answers (RQ2). Such findings clearly suggest the potential of using Stack Overflow for relevant API suggestion. Experiments using 175 code search queries randomly chosen from three Java tutorial sites–KodeJava, Java2s and Javadb–show that our technique can recommend relevant API classes with an accuracy of 83%, a mean average precision@10 of 46% and a recall@10 of 54%, which are 66%, 79% and 87% higher respectively than that of the state-of-the-art (Thung et al. 2013b) (RQ4, RQ8). Query reformulations using our suggested API classes improve 46%–64% of the baseline queries (i.e., contain natural language only), and their overall code retrieval accuracy improves by 19% (RQ9). Comparisons with the state-of-the-art techniques on query reformulation (Nie et al. 2016; Zhang et al. 2017) also demonstrate that RACK offers 48% net improvement in the baseline query quality as opposed to 26% by the state-of-the-art, which is 87% higher (RQ10). Our investigations with Google, Stack Overflow native search, and GitHub native code search also report that our reformulated queries can improve their results by 22%–26% in precision and 12%–28% in reciprocal rank in the context of code example search (RQ11).

Novelty in Contribution

This paper is a significantly extended version of our earlier work (Rahman et al. 2016) which employed two heuristics (KAC and KKC, Section III-B), experimented with 150 queries, and answered seven research questions. This work extends the earlier work in various aspects. First, we improve the earlier heuristics by recalibrating their weights and thresholds (i.e., RQ7). Second, we introduce a novel heuristic– Keyword Pair API Co-occurrence (KPAC, Section III-B)–that leverages word co-occurrences for candidate API selection more effectively. In fact, this one performs better than the earlier two. Third, we conduct experiments with a larger dataset containing 175 distinct queries, and further evaluate them in terms of their code retrieval performance (i.e., missing in the earlier work). Fourth, we extend our earlier analysis and answer 11 research questions (i.e., as opposed to seven questions answered by the earlier work). Fifth, we investigate the potential application of our approach in the context of code search using popular web search engines (e.g., Google) and code search engines (e.g., GitHub).

Thus, this journal article makes the following contributions:

  • An exploratory study that suggests the potential of using Stack Overflow for relevant API suggestion against an NL query intended for code search.

  • A keyword-API mapping database that maps 655K question keywords to 551K API classes from Stack Overflow (Table 1).

  • A novel technique–RACK–that exploits query keyword-API associations stored in the crowdsourced knowledge of Stack Overflow, and reformulates a natural language query using a set of relevant API classes from Stack Overflow.

  • Comprehensive evaluation of the proposed technique with five performance metrics, and comparison with the state-of-the-art techniques and contemporary web search engines (e.g., Google, Stack Overflow native search) and code search engines (e.g., GitHub native code search).

Table 1 API packages for exploratory study

Structure of the Article

The rest of article is organized as follows: Section 2 discusses design and findings of our exploratory study, and Section 3 describes our proposed technique for query reformulation. Section 4 focuses on our conducted evaluation and validation, Section 5 on threats to validity, Section 6 on related work from the literature, and finally Section 7 concludes the article with future research directions projected by this work.

2 Exploratory Study

Our technique relies on the mapping between natural language keywords from the questions of Stack Overflow and API classes from corresponding accepted answers for translating a code search query into relevant API classes. Thus, an investigation is warranted whether such answers contain any API related information and the questions contain any search query keywords. We perform an exploratory study using 172,043 Q & A threads from Stack Overflow, and analyse the usage and coverage of standard Java API classes in them. We also explore if the question titles are a potential source of suitable keywords for code search. We particularly answer three research questions as shown in Table 2.

Table 2 Research questions answered using exploratory study

2.1 Data Collection

We collect 172,043 questions and their accepted answers from Stack Overflow using StackExchange data explorerFootnote 1 for our investigation. Since we are interested in Java APIs, we only collect such questions that are annotated with java tag. In addition, we apply several other constraints–(1) each of the questions should have at least three answers (i.e., average answer count) with one answer being accepted as the solution, in order to ensure that the questions are answered substantially and successfully (Mamykina et al. 2011), and (2) the accepted answers should contain code like elements such as code snippets or code tokens so that API information can be extracted from them. We identify the code elements with the help of < code > tags in the HTML source of the answers (details in Section 2.2), and use JsoupFootnote 2, a popular Java library, for HTML parsing and content extraction.

We repeat the above steps, and construct another dataset by collecting 440K Q & A threads from one of our recent works (Rahman and Roy 2018). This dataset is a superset of the above collection, and it contains more recent threads from Stack Overflow. We call it the extended dataset in the remaining sections of the exploratory study.

We collect a total of 3,809 Java API classes for our study from 19 packages of standard Java edition 7. While 2,912 classes are taken from 11 core Java packagesFootnote 3, the remaining classes have come from 8 non-core Java packages. The goal is to find out if these classes are referred to in Stack Overflow posts, and if yes, to what extent they are referred to. We first use Java ReflectionsFootnote 4, a runtime meta data analysis library, to collect the API classes from JDK 7, and then apply regular expressions on their fully qualified names for extracting the class name tokens. Table 1 shows class statistics of the 19 API packages selected for our investigation.

We also collect a set of 18,662 real life search queries from the Google search history of the first author over the last eight years, which are analysed to answer the third research question. Although the queries come from a single user, they contain a large vocabulary of 9,029 distinct natural language search keywords, and the vocabulary is built over a long period of time. Thus, a study using these queries can produce significant intuitions and help answer the third research question.

2.2 API Class Name Extraction

Several existing studies (Rigby and Robillard 2013; Dagenais and Robillard 2012; Bacchelli et al. 2010) extract code elements such as API packages, classes and methods from unstructured natural language texts (e.g., forum posts, mailing lists) using information retrieval (e.g., TF-IDF) and island parsing techniques. In the case of island parsing, they apply a set of regular expressions describing Java language specifications (Gosling et al. 2012), and isolate the land (i.e., code elements) from water (i.e., free-form texts). We borrow their parsing technique (Rigby and Robillard 2013), and apply it to the extraction of API elements from Stack Overflow posts. Since we are interested in the API classes only, we adopt a selective approach for identifying them in the post contents. We first isolate the code like sections from HTML source of each of the answers from Stack Overflow using < code > tags. Then we split the sections based on white spaces and punctuation marks, and collect the tokens having the camel-case notation of Java class (e.g., HashSet). According to the existing studies (Rigby and Robillard 2013; Dagenais and Robillard 2012), such parsing of code elements sometimes introduces false positives. Thus, we restrict our exploratory analysis to a closed set of 3,809 API classes from 19 Java packages (details in Table 1) to avoid false positives (e.g., camel-case tokens but not valid API classes).

2.3 Answering RQ 1: Use of APIs in the Accepted Answers of Stack Overflow

Since our API suggestion technique exploits keyword-API associations from Stack Overflow, we investigate whether the accepted answers actually use certain API classes of interest in the first place. According to our investigation, out of 172,043 accepted answers, 136,796 (79.51%) answers refer to one or more Java API class-like tokens. About 61.02% of the answers actually use API classes from 11 core Java packages whereas 9.94% of them use the classes from 8 non-core packages as a part of their solution. We analyse the HTML contents from Stack Overflow answers with tool supports and then detect the occurrences of 3,809 standard API classes (Table 1) in each of the accepted answers using a closed-world assumption (Rigby and Robillard 2013). We then examine the statistical properties or distribution of such API occurrence frequencies (i.e., total appearances, unique appearances) and attempt to answer our first research question.

Figure 2 shows (a) probability mass function (PMF) and (b) cumulative density function (CDF) for the total occurrences of API classes per SO answer where the API classes belong to the core Java API packages. Both density curves suggest that the frequency observations derive from a heavy-tailed distribution, and majority of the densities accumulate over a short frequency range. That is, most of the time only a limited number of API classes co-occur in each answer from Stack Overflow. The empirical CDF curve also closely matches with the theoretical CDFFootnote 5 (i.e., red dots in Fig. 2b) of a Poisson distribution. Thus, we believe that the observations are probably taken from a Poisson distribution. We get a 95% confidence interval over [5.27, 5.37] for mean frequency, λ = 5.32, which suggests that the API classes from the core packages are referred to at least five times on average in each of the answers from Stack Overflow. We also get 10th quantile at frequency= 2 and 97.5th quantile at frequency= 10 which suggest that only 10% of the frequencies are below 3 and only 2.5% of the frequencies are above 10. When our investigation is repeated for non-core classes, we get a mean frequency, λ = 0.36, with 95% confidence interval over [0.35, 0.37]. When 11 core and 8 non-core packages are combined and employed against the extended dataset, we get a 95% confidence interval over [23.62, 23.87] for the mean frequency, λ= 23.75 with a similar distribution (i.e., Fig. 3). Figure 4 shows density curves of the core API class occurrences per answer where only unique API classes are considered. These observations are also drawn from a heavy-tailed distribution. We get a 95% confidence interval over [2.35, 2.38] for the mean frequency, λ = 2.37, which suggests that at least two distinct classes are used on average in each answer. 30th quantile at frequency = 1 and 80th quantile at frequency = 4 suggest that 30% of the Stack Overflow answers refer to at least one API class whereas 20% of the answers refer to at least four distinct API classes from the core Java packages under our study. In the case of non-core classes, we get 90th quantile at frequency = 1, which suggests that their frequencies are negligible. When the same investigation is repeated with 19 (11 core + 8 non-core) packages against the extended dataset, we get a 95% confidence interval over [3.44, 3.46] for λ= 3.45 with a similar heavy tailed distribution (i.e., Fig. 5).

figure d
Fig. 2
figure 2

Frequency distribution for core API classes – a API frequency PMF, b API frequency CDF

Fig. 3
figure 3

Frequency distribution for core and non-core API classes over the extended dataset – a API frequency PMF, b API frequency CDF

Fig. 4
figure 4

Frequency distribution of unique API classes from core packages – a Distinct API frequency PMF, b Distinct API frequency CDF

Fig. 5
figure 5

Frequency distribution of unique API classes from core and non-core packages – a Distinct API frequency PMF, b Distinct API frequency CDF

2.4 Answering RQ 2: Coverage of API Classes in the Accepted Answers from Stack Overflow Q & A Site

Since our technique exploits inherent mapping between API classes in Stack Overflow answers and keywords from corresponding questions for API suggestion, we need to investigate if such answers actually use a significant portion of the API classes from the standard packages as a part of the solution. We thus identify the occurrences of the API classes from core and non-core packages (Table 1) in Stack Overflow answers, and determine the API coverage for these packages.

Figure 6 shows the fraction of the API classes that are used in Stack Overflow answers for each of the 11 core packages under study. We note that at least 60% of the classes are used in Stack Overflow for nine out of 11 packages. The remaining two packages–java.math and javax.swing have 55.56% and 37.41% class coverage respectively. Among these nine packages, three large packages– java.lang, java.util and java.io have a class coverage over 70%. Thus, on average, 65% of the classes are mentioned at least once in Stack Overflow. In Fig. 7, when our investigations are repeated using 19 (11 core + 8 non-core) packages and an extended dataset, we get a 95% confidence interval over [56.11, 73.01] for mean coverage, μ= 64.56% with a normal distribution. We note that at least 40% of the classes from seven non-core packages are used in Stack Overflow. The remaining package, javax.management, has a class coverage of ≈ 20%. Figure 8 shows the fraction of Stack Overflow answers (under study) that use API classes from each of the core 11 packages. We see that classes from java.lang package are used in over 50% of the answers, which can be explained since the package contains a number of frequently used and basic classes such as String, Integer, Method, Exception and so on. Two packages– java.util and java.awt that focus on utility functions (e.g., unzip, pattern matching) and user interface controls (e.g., radio button, check box) respectively have a post coverage over 20%. We also note that classes from java.io and javax.swing packages are used in over 10% of the Stack Overflow answers, whereas the same statistic for the remaining six packages is less than 10%. When our investigations are repeated using 19 (11 core + 8 non-core) packages with the extended dataset, most of the above findings on core packages are reproduced, as shown in Fig. 9a. However, as in Fig. 9b, we see that API classes from all eight non-core packages except javax.xml are used in less than 5% of the Stack Overflow answers under study. Thus, although a significant amount (e.g., 40%) of the classes from non-core packages are mentioned in Stack Overflow at least for once (i.e., Fig. 7b), as a whole, they are less frequently discussed compared to the core classes. Such finding can also be explained by the highly specific functionalities (e.g., RMI, print) of the classes from non-core packages under study.

figure e
Fig. 6
figure 6

Coverage of API classes from core packages by Stack Overflow answers

Fig. 7
figure 7

Coverage of API classes from a core and b non-core packages by Stack Overflow answers (extended dataset)

Fig. 8
figure 8

Use of core API packages in the Stack Overflow answers

Fig. 9
figure 9

Use of a core and b non-core API packages in the Stack Overflow answers (extended dataset)

2.5 Answering RQ 3: Presence of Code Search Keywords in the Title of Questions from Stack Overflow

Our technique relies on the mapping between natural language terms from Stack Overflow questions and API classes from corresponding accepted answers for augmenting a code search query with relevant API classes. Thus, an investigation is warranted on whether keywords used for code search are present in the SO question texts or not. We are particularly interested in the title of a Stack Overflow question since it summarizes the technical requirement precisely using a few important words, and also resembles a search query. We analyse the titles of 172,043 Stack Overflow questions and 18,662 real life queries used for Google search (Section 2.1). Since we are interested in code related queries, we only select such queries that were intended for code search. Rahman et al. (2018) recently used popular tags from Stack Overflow questions to separate code related queries from non-code queries that were submitted to a general-purpose search engine, Google. We use a subset of their selected tags (shown in Table 3) for identifying the code related queries. We discover 3,073 such queries from our query collection (Section 2.1) where the queries contain a total of 2,001 unique search keywords.

Table 3 Keywords intended for code search

According to our analysis, 172,043 question titles contain 20,391 unique terms after performing natural language preprocessing (i.e., stop word removal, splitting and stemming). These terms match 69.22% of the keywords collected from our code search queries. Figure 10 shows the fraction of the search keywords that match with the terms from Stack Overflow questions for the past eight years starting from 2008. On average, 62.69% of the code search keywords from each year match with Stack Overflow vocabulary derived from its question titles.

Fig. 10
figure 10

Coverage of keywords from the collected queries in Stack Overflow questions

Figure 11 shows (a) probability mass function, and (b) cumulative density function of keyword frequency in the question titles. We see that the density curve shows the central tendency like a normal curve (i.e., bell shaped curve), and the empirical CDF closely matches with the theoretical CDF (i.e., red curve) of a normal distribution with mean, μ = 3.22 and standard deviation, σ = 1.60. We also draw 172,043 random samples from a normal distribution with equal mean and standard deviation, and compare with the keyword frequencies. Our Kolmogorov-Smirnov test reported a p-value of 2.2e-16<0.05 which suggests that both sample sets belong to the same distribution. Thus, we believe that the keyword frequency observations come from a normal distribution. We get a mean frequency, μ = 3.22 with 95% confidence interval over [3.21, 3.23], which suggests that each of the question titles from Stack Overflow contains at least three code search keywords on average. Furthermore, a recent query classification model that leverages Stack Overflow tags for separating code queries from non-code queries achieves a promising accuracy of 87% precision and 86% recall (Rahman et al. 2018). Such findings further suggest the potential of Stack Overflow vocabulary for improving the code search.

Fig. 11
figure 11

Collected search query keywords in Stack Overflow– a Keyword frequency PMF b Keyword frequency CDF

We also collect all the Q & A threads from two other popular tutorial sites–JavatpointFootnote 6 and TutorialspointFootnote 7, construct two baseline vocabularies from them, and then contrast with the vocabulary of Stack Overflow. Table 4 shows the statistics on downloaded pages and unique terms extracted from them. For example, Tutorialspoint has a total of 2,219 web pages, and they form a vocabulary of 14,930 unique terms when both title and body of the pages are considered. It encompasses various programming domains including Java, C/C++, and C#. On the contrary, when titles from only Java related questions of Stack Overflow are considered, they form a vocabulary of 20K. We also note that terms from Tutorialspoint page titles match only ≈20% of the code search keywords. On the contrary, such matching ratio is 69% for Stack Overflow which is 237% higher. Surprisingly, when analysed from a granular perspective, Stack Overflow might not be better than these two sites. For example, titles from Javatpoint and Tutorialspoint provide 15.91% and 9.08% of search keywords as opposed to <1.00% by Stack Overflow when 1000 random pages are analysed. However, Stack Overflow offers (1) a nice combination of query terms (in the questions) and API classes (in the code snippets), and (2) a much larger collection of Q & A threads compared to Javatpoint and Tutorialspoint across various domains. Thus, it has a higher potential for assisting the developers in traditional code search.

figure f
Table 4 Code search keywords found in tutorial sites

3 RACK: Automatic Reformulation of Code Search Query Using Crowdsourced Knowledge

According to the exploratory study (Section 2), at least two API classes are used in each of the accepted answers of Stack Overflow, and about 65% of the API classes from the core packages are used in these answers. Besides, the titles from Stack Overflow questions are a major source of query keywords for code search. Such findings suggest that Stack Overflow might be a potential source not only for code search keywords but also for API classes relevant to them. Since we are interested in exploiting this keyword-API association from Stack Overflow questions and answers for API suggestion (i.e., for query reformulation), we need a technique that stores such associations, mines them automatically, and then recommends the most relevant APIs. Thus, our proposed technique has two major steps – (a) Construction of token-API mapping database, and (b) Recommendation of relevant API classes for a code search query which is written in natural language (a.k.a., NL query). Figure 12 shows the schematic diagram of our proposed technique–RACK– for API recommendation targeting query reformulation.

Fig. 12
figure 12

Proposed technique for API recommendation–a Construction of token-API mapping database, b Translation of a code search query into relevant API classes

3.1 Construction of NL Token-API Mapping Database

Since our technique relies on keyword-API associations from Stack Overflow, we need to extract and store such associations for quick access. In Stack Overflow, each question describes a technical requirement such as “how to send an email in Java?” The corresponding answer offers a solution containing code example(s) that refer(s) to one or more API classes (e.g., MimeMessage, Transport). We capture both the requirement and API classes carefully, and exploit their semantic association for the development of token-API mapping database. Since the title summarizes a question using a few but important words, we only use the titles from the questions. Acceptance of an answer by the person who posted the question indicates that the answer actually meets the requirement in the question. Thus, we consider only the accepted answers from the answer collection for our analysis. The construction of the mapping database has several steps as follows:

Token Extraction from Titles

We collect title(s) from each of the questions, and apply standard natural language pre-processing steps such as stop word removal, splitting and stemming on them (Step 1, Fig. 12a). Stop words are the frequently used words (e.g., the, and, some) that carry very little meaning for a sentence. We use a stop word listFootnote 8 hosted by Google for the stop word removal step. The splitting step splits each word containing any punctuation mark (e.g., .,?,!,;), and transforms it into a list of words. Finally, the stemming step extracts the root of each of the words (e.g., “send” from “sending”) from the list, where Snowball stemmer (Ponzanelli et al. 2014; Vassallo et al. 2014) is used. Thus, we extract a set of unique and stemmed words that collectively convey the meaning of the question title, and we consider them as the “tokens” from the title of a question from Stack Overflow. Finally, our database ended up with a total of 19,783 unique NL terms.

API Class Extraction

We collect the accepted answer for each of our selected questions, and parse their HTML source using Jsoup parserFootnote 9 for code segments (Step 2, 3, Fig. 12a). We extract all < code > and < pre > tags from the source content as they generally contain code segments (Rahman et al. 2014). It should be noted that code segments may sometimes be demarcated by other tags or no tag at all. However, identification of such code segments is challenging and often prone to false-positives. Thus, we restrict our analysis to contents inside < code > tags and < pre > for code segment collection from Stack Overflow. We split each of the segments based on punctuation marks and white spaces, and discard the programming keywords. Existing studies (Bacchelli et al. 2010; Rigby and Robillard 2013) apply island parsing for API method or class extraction where they use a set of regular expressions. Similarly, we use a regular expression for Java class (Gosling et al. 2012), and extract the API class tokens having a camel case notation. Thus, we collect a set of unique API classes from each of the accepted answers. The API classes (e.g., String, Integer, Double) from java.lang package are mostly generic and frequently used in the code, which is also supported by our RQ2. Hence, we also avoid all the API classes from this package during our API extraction from Stack Overflow answers.

Token-API Linking

Natural language tokens from a question title hint about the technical requirement described in the question, and API names from the accepted answer represent the relevant APIs that can meet such requirement. Thus, the programming Q & A site–Stack Overflow– inherently provides an important semantic association between a list of tokens and a list of APIs. For instance, our technique generates a list of natural language tokens–{generat, md5, hash}– and an API token– MessageDigest– from the showcase example on MD5 hash (Fig. 1). We capture such associations from 126,567 Stack Overflow question and accepted answer pairs, and store them in a relational database (Step 4, 5, Fig. 12a) for relevant API recommendation for any code search query.

3.2 API Relevance Ranking & Reformulation of the NL-Query

In the token-API mapping database, each NL token (or term) associates with different APIs, and each API class associates with a number of NL tokens. Thus, we need a technique that carefully analyses such associations, identifies the candidate APIs, and then recommends the most relevant ones from them for a given query. It should be noted that we do not apply the traditional association rule mining (Xie and Pei 2006). Our investigations using the constructed database (Section 3.1) report that frequencies of co-occurrence between NL terms and API classes in Stack Overflow posts are not sufficient enough to form association rules for all queries. The API class ranking and recommendation targeting our query reformulation for code search involve several steps as follows:

3.2.1 Identification of Keyword Context

In natural language processing, the context of a word refers to the list of other words that co-occur with that word in the same phrase, same sentence or even the same paragraph (Harris 1968). Co-occurring words complement the semantics of one another (Mihalcea and Tarau 2004). Yuan et al. (2014) analyse programming posts and tags from Stack Overflow Q & A site, and use word context for determining semantic similarity between any two software-specific words. In this research, we identify the words that co-occur with each query keyword in the thousands of question titles from Stack Overflow. For each keyword, we refer to these co-occurring words as its context. We then opportunistically use these contextual words for estimating semantic relevance between any two keywords.

3.2.2 Candidate API Selection

In order to collect candidate APIs for a NL query, we employ three different heuristics. These heuristics consider not only the association between query keywords and APIs but also the coherence among the APIs themselves. Thus, the key idea is to identify such programming APIs as candidates that are not only likely for the query keywords but also functionally consistent to one another.

Keyword-API Co-occurrence (KAC)

Stack Overflow discusses thousands of programming problems, and these discussions contain both natural language texts (i.e., keywords) and reference to a number of APIs. According to our observation, several keywords might co-occur with a particular API and a particular keyword might co-occur with several APIs across different programming solutions. This co-occurrence generally takes place either by chance or due to semantic relevance. Thus, if carefully analysed, such co-occurrences could be a potential source for semantic association between keywords and APIs. We capture these co-occurrences (i.e., associations) between keywords from question titles and APIs from accepted answers, discard the random associations using a heuristic threshold (δ), and then collect the top API classes (LKAC[Ki]) for each keyword (Ki) that co-occurred most frequently with the keyword at Stack Overflow.

$$L_{KAC}[K_{i}]=\{A_{j}\mid A_{j}\epsilon A ~\wedge~ rank_{freq}(K_{i}\rightarrow A_{j})\le \delta\} $$

Here, \(K_{i}\rightarrow A_{j}\) denotes the association between a keyword Ki and an API class Aj, rankfreq returns rank of the association from the ranked list based on association frequency, and δ is a heuristic rank threshold. In our research, we consider top ten (i.e., δ = 10) APIs as candidates for each keyword, which is carefully chosen based on iterative experiments on our dataset (see RQ7 for details).

Keyword Pair–API Co-occurrence (KPAC)

While frequent co-occurrences of APIs with a query keyword are a good indication of their relevance to the query, they might also fall short due to the fact that the query might contain more than one keyword. That is, API classes relevant to (i.e., frequently co-occurred with) one keyword might not be relevant to other keywords from the query. Thus, API classes that are simultaneously relevant to multiple keywords should be selected as candidates. We consider nC2 keyword pairs from n keywords of a query using combination theory, and identify such APIs that frequently co-occur with both keywords from each pair in the same context (e.g., same Q & A thread). Suppose, Ki and Kj are two keywords, and they form one of the nC2 keyword pairs from the query. Now, the candidate API classes (LKPAC[Ki,Kj]) are relevant if they occur in an accepted answer of Stack Overflow whereas both keywords appear in the corresponding question title. We select such relevant candidates as follows:

$$L_{KPAC}[K_{i}, K_{j}]=\{A_{m}\mid A_{m}\epsilon A ~\wedge~ rank_{freq}((K_{i}, K_{j})\rightarrow A_{m})\le \delta\} $$

Here (\(K_{i}, K_{j})\rightarrow A_{m}\) denotes the association between keyword pair (Ki,Kj) from a question title and API class Am from the corresponding accepted answer of Stack Overflow. We capture top ten (i.e., δ = 10) such co-occurrences for KPAC heuristic, and the detailed justification for this choice can be found in RQ7. We determine the association based on their co-occurrences in the same set of documents. In this case, each question-answer thread from Stack Overflow is considered as a document. While co-occurrences of keyword triples with APIs could also be considered for API candidacy, existing IR-based studies report that phrases of two words are more effective as a semantic unit (e.g., “chat room”) rather than the triples (e.g., “find chat room”) (Mihalcea and Tarau 2004; Rahman and Roy 2017).

Keyword–Keyword Coherence (KKC)

The two heuristics above determine relevant API candidacy based on the co-occurrence between query keywords and APIs in the same document. That is, multiple keywords from the query are also warranted to co-occur in the same document. However, such co-occurrences might not always happen, and yet the keywords could be semantically related to one another (i.e., co-occurred in the query). More importantly, the candidate APIs should be relevant to multiple keywords that do not co-occur. Yuan et al. (2014) determine semantic similarity between any two software specific words by using their contexts from Stack Overflow questions and answers. We adapt their technique for identifying coherent keyword pairs which might not co-occur. The goal is to collect candidate APIs relevant to these pairs based on their coherence. We (1) develop a context (Ci) for each of the n query keywords by collecting its co-occurring words from thousands of question titles from Stack Overflow, (2) determine semantic similarity for each of the nC2 keyword pairs based on their context derived from Stack Overflow, and (3) use these measures to identify the coherent pairs and then to collect the functionally coherent APIs for them. At the end of this step, we have a set of candidate APIs for each of the coherent keyword pairs.

Suppose, two query keywords Ki and Kj have context word list Ci and Cj respectively. Now, the candidate APIs (LKKC) that are relevant to both keywords and functionally consistent with one another can be selected as follows:

$$L_{KKC}[K_{i},K_{j}]=\{L[K_{i}]\cap L[K_{j}] \mid cos(C_{i},C_{j})>\gamma\} $$

Here, cos(Ci,Cj) denotes the cosine similarity (Rahman et al. 2014) between two context lists– Ci and Cj, and γ is the similarity threshold. We consider γ = 0 in this work based on iterative experiments on our dataset (see RQ7 for the detailed justification). L[Ki] and L[Kj] are top frequent APIs for the two keywords– Ki and Kj where Ki and Kj might not co-occur in the same question title. Thus, LKKC[Ki,Kj] contains such APIs that are relevant to both keywords (i.e., co-occurred with them in Stack Overflow answers) and functionally consistent with one another. Since the candidate APIs co-occur with the keywords from each coherent pair (i.e., semantically similar, γ > 0) in different contexts, they are also likely to be coherent for the programming task at hand. Such coherence often could be explained in terms of the dependencies among the API classes.

figure g

3.2.3 API Relevance Ranking Algorithm

Figure 12b shows the schematic diagram, and Algorithm 1 shows the pseudo code of our API relevance ranking algorithm–RACK. Once a search query is submitted, we (1) perform Part-of-Speech (POS) tagging on the query for extracting the meaningful words such as nouns and verbs (Wong et al. 2013; Capobianco et al. 2013), and (2) apply standard natural language preprocessing (i.e., stop word removal, splitting, and stemming) on them to extract the stemmed words (Lines 3–4, Algorithm 1). For example, the query–“html parser in Java” turns into three keywords–‘html’, ‘parser’ and ‘java’ at the end of the above step. We then apply our three heuristics–KAC, KPAC and KKC– on the stemmed keywords, and collect candidate APIs from the token-API linking database (Step 2, Fig. 12b, Lines 5–8, Algorithm 1). The candidate APIs are selected based on not only their co-occurrence with the query keywords but also the coherence (i.e., functional consistency) among themselves. We then use the following metrics (i.e., derived from the above heuristics) to estimate the relevance of the candidate API classes for the query.

API co-occurrence likelihood

estimates the probability of co-occurrence of a candidate API (Aj) with one (Ki) or more (Ki,Kj) keywords from the search query. It considers the rank of the API in the ranked list based on keyword-API co-occurrence frequency (i.e., KAC and KPAC) and the size of the list, and then provides a normalized score (on the scale from 0 to 1) as follows:

$$S_{KAC}(A_{j}, K_{i})= 1-\frac{rank(A_{j}, sortByFreq(L[K_{i}]))}{|L_{KAC}[K_{i}]|} $$
$$S_{KPAC}(A_{j}, K_{i}, K_{j})= 1-\frac{rank(A_{j}, sortByFreq(L_{KPAC}[K_{i},K_{j}]))}{|L_{KPAC}[K_{i},K_{j}]|} $$

Here, SKAC and SKPAC denote the API co-occurrence likelihood estimates, and they range from 0 (i.e., not likely at all for the keywords) to 1 (i.e., very much likely for the keywords). The more likely an API is for the keywords, the more relevant it is for the query. This approach might also encourage the common API classes (e.g., List, String) that are often used with most programming tasks. Such APIs might not be helpful for relevant code snippet search. We thus apply appropriate filters and thresholds to avoid such noise.

API coherence

estimates the coherence of an API (Aj) with other candidate APIs for a query. Since the query targets a particular programming task (e.g., “parsing the HTML source”), the suggested APIs should be logically consistent with one another. One way to heuristically determine such coherence is to exploit the semantic relevance among the corresponding keywords that co-occurred with that API (Aj). The underlying idea is that if two keywords are semantically similar, their co-occurred API sets could also be logically consistent with each other. We thus determine semantic similarity between any two keywords (Ki,Kj) from the query using their context lists (Ci,Cj) (Yuan et al. 2014), and then propagate that measure to each of their candidate API classes (Aj) that co-occurred with both of the keywords (i.e., KKC) as follows:

$$S_{KKC}(A_{j},K_{i},K_{j})= cos(C_{i},C_{j}) \mid (K_{i}\rightarrow A_{j})\wedge (K_{j}\rightarrow A_{j}) $$

Here, SKKC denotes the API Coherence estimate, and it ranges from 0 (i.e., not relevant at all with multiple keywords) to 1 (i.e., highly relevant). It should be noted that each candidate, Aj, comes from L[Ki] or L[Kj], i.e., the API is already relevant to each of Ki and Kj in their corresponding contexts. SKKC investigates how similar those contexts are, and thus heuristically estimates the coherence between the APIs from these contexts.

We first estimate API Co-occurrence Likelihood of each of the candidate APIs that suggests the likeliness of the API for one or more keywords from the given query (Lines 9–22, Algorithm 1). Then we determine API Coherence for each candidate API that suggests coherence of the API with other candidate APIs for the query (Lines 23–30). Once all metrics of each candidate are calculated (Step 3, Fig. 12b), only the maximum score is taken into consideration where appropriate weights–α, β and (1 − αβ)–are applied (Lines 31–34, Algorithm 1). These weights control how two of our above dimensions– co-occurrence and coherence–affect the final relevance ranking of the candidates. We consider a heuristic value of 0.325 for α and a value of 0.575 for β, and the detailed weight selection method is discussed in Section 4.9. The candidates are then ranked based on their final scores, and Top-K API classes from the ranked list are returned as API recommendation (Lines 35–36, Algorithm 1, Step 4, 5, Fig. 12b). Such API classes are then used for NL-query reformulation.

Working Example

Table 5 shows a working example of how our proposed query reformulation technique –RACK– works. Here we reformulate our natural language query–“HTML parser in Java”–into relevant API classes. We first apply KAC heuristic, and collect the Top-5 (i.e., δ = 5) candidate APIs for each of the three keywords–‘html’,‘parser’ and ‘java’– based on co-occurrence frequencies of the candidates with the keywords. We also repeat the same step for each of the three (i.e., 3C2) keyword pairs–(html, parser), (html, java) and (parser, java) by applying our KPAC heuristic. Then we estimate co-occurrence likelihood (with the keywords and keyword pairs) of each of the candidate APIs. For example, Document has a maximum likelihood of 1.00 among the candidates not only for the single search keyword but also for the keyword pairs. We then determine coherence of each candidate API (with other candidates) based on semantic relevance among the above three keyword pairs. For example, ‘html’ and ‘parser’ have a semantic relevance of 0.42 between them (on the scale from 0 to 1) based on their contexts from Stack Overflow questions and answers, and they have several common candidates such as Document, Element and File. Since the two keywords are semantically relevant, their relevance score (i.e., 0.42, SKKC) is propagated to their shared candidate APIs as a proxy to the coherence among the candidates. We then gather all scores for each candidate, choose the best score, and finally get a ranked list. From the recommended list, we see that Document, Element and Jsoup are highly relevant APIs from Jsoup library for the given NL-query. Our technique returns such a list of relevant API classes as the reformulation to an original NL query.

Table 5 An example of query reformulation using RACK

4 Experiment

One of the most effective ways to evaluate a technique that suggests relevant API classes or methods for a query is to check their conformance with the gold set APIs of the query. Since the suggested APIs could be used to reformulate the initial query (i.e., using natural language), the quality of the automatically reformulated query could be another performance indicator for the technique. We evaluate our technique using 175 code search queries, their goldset APIs and their relevant code segments (i.e., implementing the tasks in the query) collected from three programming tutorial sites. We determine the performance of our technique using six appropriate metrics from the literature. Then we compare with two variants of the state-of-the-art technique on API recommendation (Thung et al. 2013b) and a popular code search engine–Lucene (Haiduc et al. 2013)–for validating our performance. We answer eight research questions with our experiments as shown in the Table 6.

Table 6 Research questions answered using experiment

4.1 Experimental Dataset

Data Collection

We collect 175 code search queries for our experiment from three Java tutorial sites– KodeJavaFootnote 10, JavaDBFootnote 11 and Java2sFootnote 12. These sites discuss hundreds of programming tasks that involve the usage of different API classes from the standard Java API libraries. Each of these task descriptions generally has three parts–(1) a title (i.e., question) for the task, (2) one or more code snippets (i.e., answer), and (3) an associated prose explaining the code. The title summarizes a programming task (e.g., “How do I decompress a GZip file in Java?”) using natural language texts. It generally uses a few pertinent keywords (e.g., “decompress”, “GZip”), and also often resembles a query for code search (Section 2.5). We thus consider such titles from the tutorial sites as the code search queries, and use them for our experiment in this research.

Gold Set Development

The prose explaining the code often refers to one or more APIs (e.g., GZipOutputStream, FileOutputStream) from the code snippet(s) that are found to be essential for the task. In other words, such APIs can be considered as the most relevant ones (i.e., vital) for the target programming task. We collect such APIs from the prose against each of the task titles (i.e., code search queries) from our dataset, and develop a gold set–API-goldset–for the experiment. Since relevance of the APIs is determined based on working code examples and their associated prose from the publicly available and popular tutorial sites, the subjectivity associated with the relevance of the collected APIs is minimized (Chan et al. 2012). We also collect the code segments verbatim that implement each of the selected tasks (i.e., our queries) from these tutorial sites, and develop another gold set–Code-goldset–for our experiments. Our goals are to (1) compare our queries containing the suggested API classes with the baseline queries containing only NL keywords and (2) compare our queries with the reformulated queries by the state-of-the-art techniques on API recommendation (Thung et al. 2013b; Nie et al. 2016; Zhang et al. 2017).

Corpus Preparation

We evaluate not only the API recommendation performance of RACK but also the retrieval performance of its reformulated queries. We collect relevant code snippets (i.e., ground truth) for each of our 175 search queries from the above tutorial sites, and develop a corpus. It should be noted that each query-code snippet pair comes from the same Q & A thread from the tutorial sites. However, this approach leaves us with a corpus of 175 documents which do not represent a real world corpus. We thus extend our code corpus by adding more code snippets from one of our earlier works (Rahman and Roy 2014), and this provided a corpus containing 4,170 (175 + 3,995) code snippets. It should be noted that the additional 3,995 code snippets were carefully collected from hundreds of open source projects hosted at GitHub (see Rahman and Roy (2014) for details). This corpus is referred to as 4K-Corpus throughout the later sections in the paper.

We also develop two other corpora containing 256,754 (175 + 256,399) and 769,244 (175 + 769,069) documents respectively. They are referred to as 256K-Corpus and 769K-Corpus in the rest of the sections. These corpus documents are Java classes extracted from an internet-scale and well-established dataset– IJaDataset (Lopes et al. 2010; Keivanloo et al. 2014; Svajlenko et al. 2014). The dataset was constructed using 24,666 real world Java projects across various domains, and they were collected from SourceForgeFootnote 13 and Google CodeFootnote 14 repositories. We analyse 1,500,000 Java source files from the dataset, and discard the ones with a size greater than 3KB. 95% of our ground truth code segments have a size less than 3KB. The goal was to avoid the large and potentially noisy code snippets in the corpus. Given the large size (i.e., 769K documents) and cross-domain nature of the collected projects, our corpora are thus likely to represent a real world code search scenario.

We consider each of these code snippets from all three corpora as an individual document, apply standard natural language preprocessing (i.e., token splitting, stop word removal, programming keyword removal) to them, and then index the corpus documents using Apache LuceneFootnote 15, a search engine widely used by the relevant literature (Haiduc et al. 2013; Kevic and Fritz 2014a; Moreno et al. 2014). The indexed corpus is then used to determine the retrieval performance of the initial and reformulated queries for code search.

Replication

All the experimental data, associated tools and implementations are hosted onlineFootnote 16 for replication or third party reuse.

4.2 Performance Metrics

We choose five performance metrics for the evaluation and validation of our technique that are widely adopted by relevant literature (McMillan et al. 2011; Chan et al. 2012; Thung et al. 2013b). Two of them are related to recommendation systems whereas the other four metrics are widely popular in the information retrieval domain.

Top-K Accuracy/Hit@K

It refers to the percentage of the search queries for each of which at least one item (e.g., API class, code segment) is correctly returned within the Top-K results by a recommendation technique. It is also called Hit@K (Wang and Lo 2014). Top-K Accuracy of a technique can be defined as follows:

$$Top\textendash K Accuracy (Q) = \frac{{\sum}_{q\in Q} isCorrect(q, K)}{|Q|} \time 100\% $$

Here, isCorrect(q, K) returns a value 1 if there exists at least one correct API class (i.e., from the API-goldset) or one correct code segment (i.e., implements the task in query) in the Top-K returned results, and returns 0 otherwise. Q denotes the set of all search queries used in the experiment. Although Top-K Accuracy and Hit@K are used interchangeably in the literature (Thongtanunam et al. 2015; Wang and Lo 2014), we use Hit@K to denote recommendation accuracy in the remaining sections for the sake of consistency.

Mean Reciprocal Rank@K (MRR@K)

Reciprocal rank@K refers to the multiplicative inverse of the rank (i.e., 1/rank(q, K),qQ) of the first relevant API class or code segment in the Top-K results returned by a technique. Mean Reciprocal Rank@K (MRR@K) averages such measures for all search queries (∀qQ) in the dataset. It can be defined as follows:

$$MRR@K(Q) =\frac{1}{|Q|}{\sum}_{q\in Q}{\frac{1}{rank(q, K)}} $$

Here, rank(q, K) returns the rank of the first correct API or the correct code segment from a ranked list of size K. If no correct API class or code segment is found within the Top-K positions, then rank(q, K) returns \(\infty \). On the contrary, it returns 1 for the correct result at the topmost position of a ranked list. Thus, MRR can take a maximum value of 1 and a minimum value of 0. The bigger the MRR value is, the better the technique is.

Mean Average Precision@K (MAP@K)

Precision@K calculates the precision at the occurrence of every single relevant item (e.g., API class, code segment) in the ranked list. Average Precision@K (AP@K) averages the precision@K for all relevant items within Top-K results for a code search query. Mean Average Precision@K is the mean of Average Precision@K for all queries (Q) from the dataset. MAP@K of a technique can be defined as follows:

$$ AP@K=\frac{{\sum}_{k = 1}^{K}P_{k}\times rel_{k}}{\left|RR \right|} $$
$$MAP@K=\frac{{\sum}_{q\epsilon Q}AP@K(q)}{\left|Q\right|} $$

Here, relk denotes the relevance function of kth result in the ranked list that returns either 1 (i.e., relevant) or 0 (i.e., non-relevant), Pk denotes the precision at kth result, and K refers to number of top results considered. RR is the set of relevant results for a query, and Q is the set of all queries.

Mean Recall@K (MR@K)

Recall@K refers to the percentage of gold set items (e.g., API, code segment) that are correctly recommended for a code search query in the Top-K results by a technique. Mean Recall@K (MR@K) averages such measures for all queries (Q) in the dataset. It can be defined as follows:

$$MR@K(Q) =\frac{1}{|Q|}\sum\limits_{q\in Q}{\frac{|result(q, K)\cap gold(q)|}{|gold(q)|}} $$

Here, result(q, K) refers to Top-K recommended APIs by a technique, and gold(q) refers to goldset APIs for each query qQ. The bigger the MR@K value is, the better the recommendation technique is.

Query Effectiveness (QE)

It refers to the rank of first relevant document in the results list retrieved by a query. The metric approximates a developer’s effort in locating the first item of interest. Thus, the lower the effectiveness measure is, the more effective the query is (Rahman and Roy 2017; Moreno et al. 2015). We use this measure to determine whether a given query is improved or not after its reformulation.

Normalized Discounted Cumulative Gain (NDCG)

It determines the quality of ranking provided by a technique. With a graded relevance scale for results, the metric accumulates overall gain or usefulness from the top to the bottom of the list (Järvelin and Kekäläinen 2002; Wang et al. 2013). It assumes that (1) highly relevant results are more useful when they appear earlier in the ranked list, and (2) highly relevant results are more useful than marginally relevant results. Thus, Discounted Cumulative Gain (DCG) of a ranked list returned by a query q can be calculated as follows:

$$DCG(q)=\sum\limits_{k = 1}^{K}\frac{grel_{k}}{log_{2}(k + 1)}\text{~~where~~} grel_{k}= 1-\frac{goldRank(k,gold(q))}{|gold(q)|} $$

Here, grelk refers to the graded relevance of the result at position k. goldRank(.) returns the rank of the kth result within the goldset items gold(q). If kth result is not found in the goldset, grelk simply returns 0 as a special case. Thus, grelk provides a graded relevance scale between 0 and 1 for each relevant result. Once DCG(q) is calculated, the normalized DCG can be calculated as follows:

$$NDCG(q)=\frac{DCG(q)}{IDCG(q)},~~ NDGC(Q)=\frac{1}{|Q|}\sum\limits_{q\in Q}NCDG(q) $$

Here IDCG(q) is the Ideal Discounted Cumulative Gain which is derived from the ranking of goldset items. Thus, NDCG(q) is the metric for one single query q, whereas NDCG(Q) averages the metric over all queries (∀qQ). We use NDCG in order to determine the quality of code search ranking from the traditional web/code search engines (Section 4.13).

4.3 Evaluation Scenarios

Our work in this article has two different aspects– (a) relevant API suggestion and (b) automatic query reformulation. We thus employ two different setups for evaluating our approach. In the first case, we investigate API suggestion performance of RACK, calibrate our adopted parameters, and compare with the state-of-the-art approaches on API suggestion (Thung et al. 2013b; Zhang et al. 2017) (RQ4–RQ8). In the second case, we reformulate the initial NL queries from the dataset using our suggested API classes. Then we compare our reformulated queries not only with the baseline queries but also with the queries generated by the state-of-the-art approaches on query reformulation (Nie et al. 2016; Zhang et al. 2017) (RQ9–RQ10). We also investigate the potential of our queries in the context of contemporary web and code search practices (RQ11).

4.4 Statistical Significance Tests

In our comparison studies, we perform two statistical tests before claiming significance of one set of items over the other. In particular, we employ Mann-Whitney Wilcoxon (MWW) and Wilcoxon Signed Rank (WSR) for significance tests. We refer to them as MWW and WSR respectively in the remaining sections. MWW is a non-parametric test that (1) does not assume normality of the data and (2) is appropriate for small dataset (Haiduc et al. 2013). We use this test for comparing any two arbitrary lists of items. WSR test is another non-parametric test that performs pair-wise comparison between two lists. In our experiment, WSR was used for significance test between performance measures (e.g., Hit@K) of RACK in API/code suggestion and that of an existing approach for the same K positions (i.e., 1≤K ≤10) (RQ8, RQ9, RQ10). We report p-value of each statistical test, and use 0.05 as the significance threshold. In addition to these significance tests, we also perform effect size test using Cliff’s delta to demonstrate the level of significance. For this work, we use three significance levels – short (0.147≤Δ≤0.33), medium (0.33≤Δ≤0.474) and large (Δ ≥0.474) (Romano et al. 2006). We use two R packages – stats, effsize – for conducting these statistical tests.

4.5 Matching of Suggested APIs with Goldset APIs

In order to determine performance of a technique, we apply strict matching between gold set APIs and the recommended APIs. That is, we consider two API classes matched if (1) they are categorically the same, and (2) they are superclass or subclass of each other. For example, if OutputStream is a gold set API and FileOutputStream is a recommended API, we consider them and their inverse as matched. If a base class is relevant for a programming task, the derived class is also likely to be relevant and thus, the recommendation is considered to be accurate. In the case of relevant code segment retrieval, we also apply exact matching between gold set segment and returned segment by a query. Since the tutorial sites clearly indicate which of the code segments implements which of our selected tasks (i.e., queries), such matching is warranted for this case. It should be noted that items (e.g., API class, code segment) outside the goldset could be also relevant to our queries. However, we stick to our gold sets for the sake of simplicity and clarity of our experiments. Our gold sets are also publicly availableFootnote 17 for third-party replication or reuse.

4.6 Answering RQ 4: How Does the Proposed Technique Perform in Suggesting Relevant APIs for a Code Search Query?

Each of our selected queries summarizes a programming task that requires the use of one or more API classes from various Java libraries. Our technique recommends Top-K (e.g., K = 10) relevant API classes for each query. We compare the recommended items with the API-goldset and evaluate them using above four metrics. In this section, we answer RQ4 using Table 7 and Fig. 13.

Table 7 Performance of RACK
Fig. 13
figure 13

Hit@K, Mean Average Precision@K, and Mean Recall@K of RACK using a non-weighted version (i.e., dashed line) and b weighted version (i.e., solid line)

Table 7 shows the performance details of our technique for Top-1, Top-3, Top-5 and Top-10 API recommendations. We see that our technique recommends at least one API correctly for 83%+ of the queries with both its (a) non-weighted and (b) weighted versions. The weighted version applies a fine tuned weight to each of our three heuristics–KAC, KPAC and KKC–whereas the non-weighted version treats each of the heuristics equally. Such accuracy is highly promising according to the relevant literature (Chan et al. 2012; McMillan et al. 2011). Mean average precision and mean recall of RACK are 40%–46% and 53%–54% respectively for Top-10 results which are also promising. It also should be noted that RACK provides 55%–61% accuracy and 40%–48% precision for only Top-3 results which are good. That means, one out of the two suggested API classes is found to be relevant for the task, which could be really helpful for effective code search. Our mean reciprocal ranks are 0.46 and 0.52 for non-weighted and weighted version respectively. That is, the first correct suggestion is generally found between first to second position of our ranked list, which demonstrates the potential of our technique. Figure 13 shows how different performance metrics – accuracy, precision and recall– change over different values of Top-K. We see that our technique reaches a high precision (i.e., 48.14%) quite early (i.e., K = 3) and the highest (i.e., 48.39%) at K = 5, and then stays comparable for the rest of the K values. However, the improvement of recall measure is comparatively slow. It is ≈ 10% for K = 1, and then increases somewhat linearly up to 54% for the last value of K = 10. On the contrary, the accuracy of RACK improves in a log-linear fashion, and becomes somewhat stationary for K = 10 with 83%. While our accuracy and recall could further improve for increased K-values, the precision is likely to drop. Thus, we conduct our experiments using only Top-10 suggestions from a technique. Developers generally do not check items beyond the Top-10 items from the ranked list, and relevant literature (Thongtanunam et al. 2015; Rahman and Roy 2017) also widely apply such cut-off value. Thus, our choice of K = 1 to 10 is also justified.

We also analyse the distribution of API classes from 19 (11 core + 8 non-core) Java packages (i.e., Table 1) in our ground truth, and investigate how they correlate with corresponding distributions from Stack Overflow. We found that on average, 10% of the standard Java API classes from each package overlap with our ground truth classes. On the contrary, 65% of the API classes from each package are discussed in Stack Overflow Q & A threads according to RQ2. Thus, Stack Overflow discusses more API classes than the ground truth warrants for. In short, Stack Overflow is highly likely to deliver the relevant classes from standard API packages, and our approach harnesses that power. We also found that 51% of the ground truth classes come from the core packages whereas 10% of them come from the non-core packages. Since Stack Overflow has a good coverage (e.g., ≈ 65%) for both core and non-core packages (Fig. 7), RACK is also likely to perform well for such queries that require the API classes from non-core packages only.

We also determine correlation between four performance measures (e.g., Hit@10, reciprocal rank, average precision, recall) of our API suggestions (against NL queries) and the coverage of their corresponding ground truth in the constructed API database (Section 3.1). We employed two correlation methods – Pearson and Spearman, and found either very weak or negligible correlations (i.e., 0.04≤ ρ ≤0.12) between those two entities. That is, the API suggestion performance of RACK is not biased by the coverage of the ground truth API classes in our API database. Such finding strengthens the external validity of our results.

figure h

4.7 Answering RQ 5: How Effective are the Proposed Heuristics–KAC, KPAC and KKC– in Capturing the Relevant API Classes for a Query?

We investigate the effectiveness of our adopted heuristics– KAC, KPAC and KKC, and justify their combination in the API ranking algorithm (i.e., Algorithm 1). Table 8 and Fig. 14 demonstrate how each heuristic performs in capturing the relevant APIs for a given set of code search query as follows:

Table 8 Role of Proposed Heuristics– KAC, KPAC and KKC
Fig. 14
figure 14

a Hit@K of RACK, b Mean Average Precision@K (MAP@K) of RACK, and c Mean Recall@K (MR@K) of RACK for three heuristics–KAC, KPAC and KKC

From Table 8, we see that our technique suggests correct API classes for 78.00% and 79% of the queries when KAC and KPAC heuristics are employed respectively. Both heuristics leverage co-occurrences between query keywords (in the question titles) and API classes (in the accepted answers) from Stack Overflow for such recommendation. On the contrary, KKC considers coherence among the candidate API classes, and is found less effective than the former two heuristics. In fact, KPAC performs the best among all three heuristics with up to 46% precision and 52% recall. However, the weighted combination of our heuristics provides the maximum performance in terms of four metrics. It provides 83% Hit@10 with a mean reciprocal rank@10 of 0.52, a mean average precision@10 of 46% and a mean recall@10 of 54%. That is, our combination harnesses the strength from all the heuristics, and also overcomes their weaknesses simultaneously using appropriate weights. All these statistics are also highly promising according to the relevant literature (Thung et al. 2013b; McMillan et al. 2011). Thus, our combination of these three heuristics is also justified. Our earlier work combines KAC and KKC, and provides 79% Hit@10 with 35% precision and 45% recall from the experiments with 150 queries. Replication with our current extended dataset (i.e., 175 queries) reports similar performance (e.g., 78% Hit@10), which supports our earlier findings (Rahman et al. 2016) as well. In this work, we introduce the new heuristic–KPAC–which improved our performance in terms of all four metrics– Hit@10 (i.e., 7% improvement), reciprocal rank (i.e., 53% improvement), precision (i.e., 48% improvement) and recall (i.e., 17% improvement). Thus, the addition of KPAC heuristic to our ranking algorithm is justified. Furthermore, we apply appropriate weights to these heuristics for controlling their influence in the API relevance ranking. Figure 14 further demonstrates how the performance of our heuristics changes over various Top-K results. We see that KPAC is the most dominant one among the heuristics (as observed above) and achieves the maximum performance. However, the addition of the other two heuristics also improves our performance marginally (i.e., 2% – 4%) in terms of all four metrics.

figure i

4.8 Answering RQ 6: Does an Appropriate Subset of the Query Keywords Perform Better than the Whole Query in Retrieving the Relevant API Classes?

Since the proposed technique identifies relevant API classes based on their co-occurrences with the keywords from a query, the keywords should be chosen carefully. Selection of random keywords might not return appropriate API classes. Several earlier studies choose nouns and verbs from a sentence, and report their salience in automated comment generation (Wong et al. 2013) and corpus indexing (Capobianco et al. 2013). We thus also extract noun and verb terms from each query as the search keywords using Stanford POS tagger (Toutanova and Manning 2000), and then use them for our experiments. In particular, we investigate whether our selection of keywords for code search is effective or not.

From Table 9, we see that our technique performs better with noun-based keywords than with verb-based keywords. The verb-based keywords provide a maximum of 35% Hit@10. On the contrary, RACK returns correct API classes for 82% of queries with 43% precision, 55% recall and a reciprocal rank of 0.49 when only noun-based keywords are chosen for search. However, none of the performance metrics reaches the baseline performance except recall. That is, they are lower than the performance of RACK with all query terms minus the stop words. Interestingly, when both nouns and verbs are employed as search keywords, the performance reaches the maximum especially in terms of accuracy, precision and reciprocal rank. For example, RACK achieves 83% Hit@10 with 46% precision, 54% recall and a reciprocal rank of 0.52. Although the improvement over the baseline performance (i.e., with all keywords of a query) is marginal, such performances were delivered using a fewer number of search keywords. That is, our subset of keywords not only avoids the noise but also ensures a comparatively higher performance than the baseline with relatively lower costs (i.e., fewer keywords). Thus, selection of a subset of keywords from the NL query intended for code search is justified, and our subset is also found effective.

Table 9 Impact of different query term selection

We also investigate the impact of generic search keywords such as “java” in our query. According to our analysis, 11.43% of our queries in the dataset contain this keyword. From Table 9, we see that removal of this keyword marginally degrades most of the performance measures of our technique. Only marginal improvements can be observed in the recall measure for Top-5 and Top-10 results. Thus, our choice of retaining the generic keywords is also justified.

figure j

4.9 Answering RQ 7: How do the Heuristic Weights (i.e., α, β) and Threshold Settings (i.e., γ, δ) Influence the Performance of our Technique?

Our relevance ranking algorithm applies two relative weights–α and β–to our proposed heuristics, and the heuristics are also constrained with two thresholds–γ and δ. While the thresholds help the heuristics collect appropriate candidate API classes, the weights control the influence of the heuristics in the API relevance ranking. In this section, we justify our chosen weights and thresholds, and investigate how they affect the performance of our technique.

We adopt a greedy search-based technique (Yuan et al. 2014) (i.e., controlled iterative approach) for determining the relative weights for our heuristics. That is, we start our searches with our best initial guesses for α (i.e., 0.25) and β (i.e., 0.30), refine our weight estimates in every iteration with a step size of 0.025, and then stop when the fitness function (Yuan et al. 2014) (i.e., performance) reaches the global maximum. We use mean average precision@10 and mean recall@10 as the fitness functions in the search for α and β. Figure 15 shows how different values of α and β can influence the performance of RACK. Please note that when one weight is calibrated, the other one is kept constant during performance computation. We see that precision and recall of RACK reach the maximum when α ∈ [0.300, 0.325] and β= 0.575. The target weights are identified using dashed vertical lines above. While α and β are considered as the relative importance of the co-occurrence based heuristics, KAC and KPAC respectively, (1 − αβ) goes to the remaining heuristic–KKC. Since KKC is found relatively weak according to our earlier investigation, we emphasize more on α and β, and chose the following heuristic weights: 0.325, 0.575 and 0.10–for KAC, KPAC and KKC respectively. Thus, all the weights sum to 1, and such weighting mechanism was also used by an earlier study (Moreno et al. 2014). The performance of RACK is significantly higher than its non-weighted version especially in terms of MRR@K (i.e., WSR, p-value= 0.002, Δ = 1.00 (large)) and MAP@K (i.e., WSR, p-value< 0.001, Δ = 0.84 (large)) for Top-1 to Top-10 results. Thus, the application of relative weights to our adopted heuristics is also justified.

Fig. 15
figure 15

a Mean Average Precision@10 (MAP@10), and b Mean Recall@10 (MR@10) of RACK for different values of the heuristic weights–α and β

Both KAC and KPAC apply δ threshold for collecting candidate API classes from the token-API linking database. Figure 16 shows how different values of δ can affect our performance. We use Hit@K, MAP@K and MR@K as the fitness functions, and determine our fitness for Top-5 and Top-10 returned results. We see that each of these performance measures reach their maximum when δ = 10 for both settings. That is, collecting 10 candidate API classes for each keyword or keyword pair from the query is the most appropriate choice. Less or more than that provides comparable performance but not the best one. Thus, we chose δ = 10 in our algorithm, and our choice is justified.

Fig. 16
figure 16

Performance of RACK for different δ thresholds with a Top-5 results and b Top-10 results considered

KKC applies another threshold, γ, for candidate API selection that refers to the degree of contextual similarity between any two keywords from the query. Figure 17 reports our investigation on this threshold. We see that different values of γ starting from 0 to 0.5 do not change our fitness (i.e., performance) at all. Since the heuristic itself, KKC, is not strong, the variance of γ also does not have much influence on the performance of our technique. Thus, our choice of γ = 0 is also justified. That is, we consider two API classes coherent to each other when their contexts share at least one search keyword.

figure k
Fig. 17
figure 17

Performance of RACK for different γ thresholds with a Top-5 results and b Top-10 results considered

4.10 Answering RQ 8: can RACK Outperform the State-of-the-Art Techniques in Suggesting Relevant API Classes for a Given Set of Queries?

Thung et al. (2013b) accept a feature request as an input and return a list of relevant API methods. Their API suggestions are based not only on the mining of feature request history but also on the textual similarity between the request texts and the corresponding API documentations. Zhang et al. (2017) determine semantic distance between an NL query and a candidate API using a neural network model (CBOW) and a large code repository, and then suggest a list of relevant API classes for the query. To the best of our knowledge, these are the latest and the closest studies to ours in the context of API suggestion, and thus, we select them for comparison.

Since feature request history is not available in our experimental settings, we implement Description-Based Recommender module from Thung et al. (2013b) We collect API documentations of 3,300 classes from the Java standard libraries (i.e., JDK 6), and develop Vector Space Model (VSM) for each of the API classes. In fact, we develop two models for each API class using (1) class header comments only, and (2) class header comments + method header comments, and implement two variants– (Thung et al. 2013b)-I and (Thung et al. 2013b)-II for our experiments. We use Apache LuceneFootnote 18 for VSM development, corpus indexing and for textual similarity matching between the API documentations and each of the queries from our dataset. In the case of Zhang et al. (2017), we (1) make use of IJaDataset (Keivanloo and Rilling 2011) as a training corpus (as was done by the original authors), and (2) learn the word embeddings for both keywords and API classes using fastText (Bojanowski et al. 2016), an improved version of word2vec implementation. We then use these vectors to determine semantic distance between a query and the candidate API classes using cosine similarity (Rahman et al. 2014). We also determine API popularity within the training corpus, and then combine with semantic distance metric to identify a set of relevant API classes for the NL query.

Table 10 summarizes the comparative analysis between our technique–RACK– and three existing techniques. Here, emboldened items refer to maximum measures provided by the existing techniques and our technique. We see that the variants of Thung et al. (2013b) can provide a maximum of about 50% accuracy with about 26% precision and 29% recall for Top-10 results. On the other hand, RACK achieves a maximum accuracy of 83% with 46% precision and 54% recall which are 66%, 79% and 87% higher respectively. We investigate how the four performance measures change for different Top-K results for each of these three techniques. From Fig. 18, we see that Hit@K of RACK increases gradually up to 83% whereas such performance measures for the textual similarity based techniques stop at 50%. The MRR@K of RACK improves from 0.38 to 0.52 whereas such measures for the counterparts are as low as 0.20–0.29. It should be noted that RACK reaches its maximum precision, i.e., 48%, quite early at K = 3, and then its recall gradually improves up to 54% (at K = 10). On the contrary, such measures for the counterparts are at best 25% and 30% respectively. These demonstrate the superiority of our technique. From the box plots in Fig. 19, we see that RACK performs significantly higher than both variants in terms of all three metrics– accuracy, precision and recall. Our median accuracy is above 70% whereas such measures for those variants are close to 40%. The same goes for precision and recall measures. We perform significance and effect size tests, and compare our performance measures with the measures of the state-of-the-art for various Top-K results (1≤K ≤10). We found that the performance of our approach is significantly higher than that of the existing techniques in terms of Hit@K (i.e., WSR, p-value= 0.002<0.05, Δ= 0.79 (large)), MRR@K (i.e., WSR, p-value= 0.002<0.05, Δ= 0.90 (large)), MAP@K (i.e., WSR, p-value= 0.002<0.05, Δ= 0.90 (large)) and MR@K (i.e., WSR, p-value= 0.002<0.05, Δ= 0.70 (large)). All these findings above suggest that (1) textual similarity between query and API signature or documentations might not be always effective for API recommendation, and (2) semantic distance between keyword and API classes should be calculated using appropriate training corpus. Our technique overcomes that issue by applying three heuristics –KAC, KPAC and KKC– which leverage the API usage knowledge of a large developer crowd stored in Stack Overflow. Performance reported for Thung et al. (2013b) is project-specific, and the technique is restricted to feature requests (Thung et al. 2013b). On the contrary, our technique is generic and adaptable for any type of code search. It is also independent of any subject systems. Although (Zhang et al. 2017) employ a large training corpus, they learn word embeddings for NL keywords from the source code which might not be always helpful. Source code inherently has a smaller vocabulary than regular texts (Hellendoorn and Devanbu 2017). On the contrary, we leverage the contexts of NL keywords and API classes more carefully from Stack Overflow Q & A site to determine their relevance. Furthermore, we harnesses the expertise of a large crowd of technical users effectively for relevant API suggestion which was not considered by the past studies from literature. Thus, our technique possibly has a greater potential.

figure l
Table 10 Comparison of API recommendation performance with existing techniques (for various Top-K results)
Fig. 18
figure 18

Comparison of API recommendation performances with the existing techniques-a Hit@K, b Mean Reciprocal Rank@K, c Mean Average Precision@K, and d Mean Recall@K

Fig. 19
figure 19

Comparison of API recommendation with existing techniques using box plots

4.11 Answering RQ 9: can RACK Significantly Improve the Natural Language Queries in Terms of Relevant Code Retrieval Performance?

Our earlier research questions (RQ4–RQ8) evaluate the performance of RACK in suggesting relevant API classes for a natural language query intended for code search. Although they clearly demonstrate the potential of our technique, another way of evaluation could be the retrieval performance of our suggested queries. In this section, we investigate whether our reformulations to the baseline queries improve them or not in terms of their relevant code retrieval performances. We employ three corpora – 4K-Corpus, 256K-Corpus, and 769K-Corpus– each of which includes 175 ground truth code segments (see Section 4.1 for details). We apply limited natural language preprocessing (i.e., removal of stop words and keywords, splitting of complex tokens) to each corpus document, and then index them for retrieval. We employ Apache LuceneFootnote 19, a popular code search engine that has been used by several earlier studies from the literature (Haiduc et al. 2013; Ponzanelli et al. 2014; Moreno et al. 2014), for document indexing and for source code retrieval.

Table 11 and Fig. 20 summarize our findings on comparing our reformulated queries with the baseline queries. We consider two versions of our reformulated queries– RACKA and RACKA + Q–for our experiments. While RACKA comprises of suggested API classes only, RACKA + Q combines both the suggested API classes and the NL keywords from baseline queries. From Table 11, we see that the baseline queries (i.e., comprise of NL keywords) perform poorly especially with the large corpora. In the case of 256K-Corpus, they return relevant code segments at the Top-1 position and within the Top-5 positions for only 22% and 38% of the queries respectively (i.e., Hit@K). On the contrary, our reformulated queries, RACKA + Q, can return relevant code segments for 40% and 59% of the queries within Top-1 and Top-5 positions respectively, which are more promising. We see a notable increase in the query performance with the smaller corpus (i.e., 4K-Corpus) and a notable decrease with the bigger corpus (i.e., 769K-Corpus). Such observations can be explained by the reduced and added noise in the corpus respectively. However, our reformulated queries perform consistently higher than the baseline across all three corpora. For example, while the baseline Hit@10 reduces to 34% for 769K-Corpus, our reformulated queries deliver a Hit@10 of 57% which is 65% higher. Thus, our query reformulations offer 23%-80% improvement in Hit@K over the baseline performance across the three corpora. It should be noted that Hit@1 and Hit@5 could reach up to 60% and 85% respectively when the goldset API classes are used as the search queries. Combination of NL queries and goldset API classes performs even better. Such findings also strengthen our idea of suggesting and using relevant API classes for code search. However, we also see that reformulated queries containing both NL keywords and API classes (e.g., RACKA + Q) are always better than those containing only the suggested API classes (e.g., RACKA).

Table 11 Comparison of source code retrieval performance with baseline queries
Fig. 20
figure 20

Comparison of code retrieval performance with the baseline queries in terms of a Hit@K and b MRR@K

Our MRR@K measures in Table 11 are also found more promising. They suggest that on average, the relevant code segments are returned by our queries within the top three positions of the result list across all three corpora, which is promising from the perspective of practical use. Furthermore, our MRR@K measures are 29%–81% higher than the baseline counterparts across all three corpora which demonstrate the potential of our reformulated queries for code search.

Figure 20 further demonstrates the performance of baseline queries and our reformulated queries for various Top-K results. We see that Hit@K and MRR@K of our queries are higher than those of the baseline queries by a large margin across all three corpora –4K-Corpus, 256K-Corpus, and 769K-Corpus. Non-parametric tests such as Wilcoxon Singed Rank, Mann-Whitney Wilcoxon and Cliff’s delta tests also report statistical significance of our performance improvements for both Hit@K (i.e., all p-values<0.05, 0.82≤Δ≤0.94 (large)) and MRR@K (i.e., all p-values<0.05, Δ= 1.00 (large)). For the sake of simplicity, only one code segment (i.e., collected from the tutorial sites, Section 4.1) was chosen as the ground truth of each query. Thus, Hit@K and MRR@K are the most appropriate performance metrics for this case, and consequently, precision and recall were not chosen for this evaluation.

We also investigate query performance by relaxing the Top-K constraint and by analysing all the results returned by each query. Table 12 and Fig. 21 report our findings on query effectiveness (Moreno et al. 2014, 2015). That is, if the first relevant code segment by a reformulated query is returned closer to the top of the result list than that of the baseline query, we consider it as query quality improvement, and vice versa as query quality worsening. If there is no change in the result ranks between baseline and reformulated queries, we call it query quality preserving. From Table 12, we see that 46%–64% of the baseline queries can be improved by our technique, RACKA + Q, across all three corpora. It worsens only 11%–16% of the queries, and thus, offers a net gain of 35%–49% query improvement. While 60% net gain is possible in the best case scenario using gold set APIs directly, our technique delivers ≈ 50%, which is promising according to relevant literature (Rahman and Roy 2017; Haiduc et al. 2013). Figure 21 further contrast between baseline and our reformulated queries. We see that the result ranks provided by RACK are closer to zero (i.e., top of the list) across all three corpora. Such finding provides more evidence on the high potential of our suggested queries.

figure m
Table 12 Improvement of baseline queries by RACK
Fig. 21
figure 21

Comparison of QE distribution with baseline queries across a 4K-Corpus, b 256K-Corpus and c 769K-Corpus

4.12 Answering RQ 10: can RACK Outperform the State-of-the-Art Techniques in Improving the Natural Language Queries Intended for Code Search?

Although our reformulations improve the baseline queries significantly, we further validate them against the queries generated by existing techniques including the state-of-the-art. The study of Zhang et al. (2017) is a closely related work to ours. They suggest relevant API classes for natural language queries intended for code search by analysing semantic distance between query keywords and API classes. Thung et al. (2013b) is another related study in the context of relevant API suggestion which was originally targeted for feature location (i.e., project-specific code search). Recently, Nie et al. (2016) reformulate a query for code search by collecting pseudo-relevance feedback from Stack Overflow, and then by applying Rocchio’s expansion (Rocchio 1971) to the query. Their tool QECK suggests software-specific terms from programming questions and answers as query expansions. To the best of our knowledge, these are the most recent and the most closely related work to ours in the context of query reformulation for code search which make them the state-of-the-art. We thus compare our technique with these three existing techniques (Zhang et al. 2017; Nie et al. 2016; Thung et al. 2013b) in terms of Hit@K, MRR@K and Query Effectiveness (QE).

From Table 13, we see that the retrieval performance of RACK is consistently higher than that of the state-of-the-art techniques or their variants across all three corpora. Nie et al. (2016), performs the best among the existing techniques. Their approach achieves 41%–75% Hit@5 with a MRR@5 between 0.31 to 0.59 on our dataset. However, our technique, RACK, achieves 49%–78% Hit@5 with 0.35–0.62 MRR@5 which are 4%–19% and 5%–13% higher respectively. RACK also achieves a Hit@10 of 57% with the extra-large corpus (i.e., 769K-Corpus) which is 18% higher than the state-of-the-art measure, i.e., 48% Hit@10 by Nie et al. (2016) While the performance measures of each technique degrade as the corpus size grows from 4K to 769K documents, our performance measures remain consistently higher than the state-of-the-art. Thus, RACK is more robust to varying sizes of corpora than any of the existing techniques under our study.

Table 13 Comparison of code retrieval performance with existing techniques

Figure 22 further demonstrates how RACK outperforms the state-of-the-art techniques for various Top-K results in terms of Hit@K and MRR@K. We compare RACK with QECK by Nie et al. (2016) for Top-1 to Top-10 performance measures using non-parametric tests. Nie et al. (2016) is clearly the state-of-the-art according to the above analysis. Our Mann-Whitney Wilcoxon and Cliff’s delta tests reported statistical significance of RACK over Nie et al. (2016) with large effect sizes for both Hit@K (i.e., p-values<0.05, 0.33≤Δ≤0.52 (large)) and MRR@K (i.e., p-values<0.05, 0.68≤Δ≤0.90 (large)) across all three corpora. Thus, the findings above clearly demonstrate the superiority of our technique over the existing studies on query reformulation from the literature.

Fig. 22
figure 22

Comparison of code retrieval performance with existing techniques using a, b 4K-Corpus, c, d 256K-Corpus and e, f 756K-Corpus

We also compare our technique with the existing techniques in terms of Query Effectiveness (QE). From Table 14, we see that Nie et al. (2016) performs the best with 4K-Corpus whereas (Thung et al. 2013b)-II performs the best with the remaining two corpora– 256K-Corpus and 769K-Corpus. Nie et al. (2016) improves 32% of the baseline queries whereas (Thung et al. 2013b)-II improves 43%–49% of the queries. On the contrary, RACK improves 46% and 62%–64% of the baseline queries in the same contexts. In particular, our technique offers 48% net gain as opposed to 26% provided by Thung et al. (2013b)-II which is 87% higher. Thus, RACK clearly has a high potential for query reformulation than the state-of-the-art. It also should be noted that RACK degrades only 11%–16% of the queries across all three corpora which suggests the reliability and robustness of the technique. Figure 23 further contrasts the result ranks of RACK with that of the state-of-the-art approaches using box plots. We see that on average, RACK provides higher ranks, and returns results closer to the top of list than the competing approaches. For example, Thung et al. (2013b)-II returns 50% of its first correct results within the Top-8 positions and 75% of them within the Top-96 positions when dealing with extra-large corpus (i.e., 769K-Corpus). On the contrary, RACK returns such results within Top-5 and Top-42 positions which are 38% and 57% higher respectively. Similar findings can be observed with the remaining two corpora. All these findings above clearly demonstrate of superiority of our technique in query reformulation over the state-of-the-art.

figure n
Table 14 Comparison of query improvements with existing techniques
Fig. 23
figure 23

Comparison of QE distribution with the state-of-the-art using a 4K-Corpus, b 256K-Corpus, and c 769K-Corpus

4.13 Answering RQ 11: How Does RACK Perform Compared to the Popular Web Search Engines and Code Search Engines?

Existing studies (Rahman et al. 2018; Sadowski et al. 2015; Nakasai et al. 2016; Xia et al. 2017) report that software developers frequently use general-purpose web search engines (e.g., Google) for code search. Hence, these search engines are natural candidates for comparison with our technique. We thus compare our approach with three popular web and code search engines– Google, Stack Overflow native search and GitHub code search. Unfortunately, we faced several challenges during our comparison with these commercial search engines. First, results from these search engines frequently change due to their dynamic indexing. This makes it hard to develop a reliable or stable oracle from their results. In fact, we found that Top-30 Google results collected for the same query in two different dates (i.e., two weeks apart) matched only 55%. Second, Google search APIFootnote 20 was used for our experiments given that GUI based Google search is not a practical idea for 175 x 2 = 250 queries. However, this paid search API imposes certain restrictions on the number of API calls to be made. That is, results for 175 baseline queries and their reformulated queries could not be collected all at the same time. Given the changing nature of the underlying corpus, comparison between the results of baseline and reformulated queries could thus not be fair. Third, these commercial search engines are mostly designed for natural language queries. They also impose certain restrictions on the query length and query type. Hence, they might either produce poor results or totally fail to produce any results for our reformulated queries which mostly contain structured keywords (e.g., multiple API classes). Thus, we found a head-to-head comparison with these commercial search engines infeasible. Despite the above challenges, we still compare with them, and investigate whether our reformulated queries can improve their search results significantly or not through a post-processing step of their results.

Collection of Search Results and Construction of Oracle

We collect Top-30 results for each query from each search engine for oracle construction. We make use of Custom Search APIFootnote 21 by Google and native API endpoints by Stack OverflowFootnote 22 and GitHubFootnote 23, and collect the search results. Given the large volume of search results (i.e., 175 x 30 = 5,250), it is impractical to manually analyze them all. Hence, we used a semi-automated approach in constructing the oracle for these web/code search engines. In particular, we extract the code segments from each of the result pages using appropriate tools (e.g., JsoupFootnote 24). In the case of GitHub search results, we use JavaParserFootnote 25 to extract the method bodies as code segments. Then we determine their similarity against the original ground truth code that was extracted from tutorial sites in Section 4.1. For this, we use four code similarity algorithms – Cosine similarity (Rahman et al. 2014), Dice similarity (Haiduc et al. 2013), Jaccard similarity (Svajlenko and Roy 2018) and Longest Common Subsequence (LCS) (Roy and Cordy 2008). These algorithms are frequently used as the baseline for various code clone detection techniques (Roy and Cordy 2008; Svajlenko and Roy 2018). We collect four normalized code similarity scores from each result, average them, and then extract the Top-10 results containing the most relevant code segments. We then manually analyse a few of these results (and their code segments), and attempt to tweak them with various similarity score thresholds. Unfortunately, score thresholds were not sufficient enough to construct oracle for all the queries. We thus use these Top-10 results as the oracle for our web/code search engines.

Comparison Between Initial Search Results and Re-ranked Results Using the Reformulated Queries

Once a search engine returns results for natural language (NL) queries, we re-rank them with the corresponding reformulated queries provided by RACK. We first detect the presence of code segments in their contents, and then collect Top-10 documents based on their relevance to our reformulated queries (i.e., NL keywords + relevant API classes). We compare both the initial and re-ranked results with the oracle constructed above.

From Table 15, we see that our re-ranking approach improves upon the initial results returned by each of the web and code search engines. The improvements are observed especially in terms of precision, reciprocal rank and NDCG. For example, Google achieves 69% precision with a reciprocal rank of 0.82 and an NDCG of 0.46. However, our approach, RACKGoogle achieves 84% precision with a reciprocal rank of 0.92 and an NCCG of 0.67, which are 22%, 12% and 46% higher respectively. That is, although Google performs high as a general-purpose web search engine, it might not be always precise for code search. Similar observation is shared by a recent survey (Rahman et al. 2018) that reports that developers need more query reformulations during code search using the web search engines. GitHub native search achieves 55% precision, a reciprocal rank of 0.58 and an NDCG of 0.47. On the contrary, our approach, RACKGitHub, delivers 69% precision with a reciprocal rank of 0.74 and an NDCG of 0.62, which are 24%, 28% and 26% higher respectively. Such findings demonstrate the potential of our reformulated queries. Figure 24 further contrasts between our approach and the contemporary web/code search engines for Top-1 to Top-10 results. While Google is the best performer among the three search engines, our re-ranking using RACK outperforms Google with a significant margin in terms of precision (i.e., WSR, p-value<0.05, Δ= 0.90 (large)), reciprocal rank (i.e., WSR, p-value<0.05, Δ= 0.90 (large)) and NDCG (i.e., WSR, p-value<0.05, Δ= 1.00 (large)). Thus, all the findings above suggest the high potential of our reformulated queries for improving the code search performed either with web or code search engines. The status quo of Internet-scale code search is far from ideal (Rahman et al. 2018), and our reformulated queries could benefit the traditional practices.

figure o
Table 15 Comparison with popular web/code search engines
Fig. 24
figure 24

Comparison of RACK with popular web/code search engines

5 Threats to Validity

We identify a few threats to the validity of our findings. While we attempt to mitigate most of them using appropriate measures, the remaining ones should be addressed in future work. Our identified threats and their mitigation details are discussed as follows:

5.1 Threats to Internal Validity

They relate to experimental errors and biases (Yuan et al. 2014). We develop a gold set for each query by analysing the code examples and the discussions from tutorial sites which might involve some subjectivity. However, each of the examples is a working solution to the corresponding task (i.e., NL-query), and they are frequently consulted. Thus, the gold set development using sample code from the tutorial sites is probably a more objective evaluation approach than human judgements of API relevance or code relevance that introduce more subjective bias (Chan et al. 2012). According to the exploratory findings (Section 2.4), our technique might be effective only for the recommendation of popular and frequently used API classes. Since fully qualified names are mostly missing in Stack Overflow texts, third-party APIs similar to Java API classes could also have been mistakenly considered despite the fact that questions and answers selected for the study were tagged with <java>.

We use a dataset of 175 queries and a popular code search engine–Apache Lucene (Haiduc et al. 2013)–for determining their retrieval performance across three corpora of varying sizes. For the sake of simplicity, only one code segment was considered as relevant for each query. However, in practice, there could be multiple code segments in the corpus that are relevant to a given query. In this work, we trade such perfection with transparency and objectivity in our evaluation and validation.

During code or web search, developers generally choose the most appropriate keywords when a list of auto-generated suggestions are provided. We re-enact such behaviour of the developers by choosing only goldset API classes from within the suggested list, and use them for query reformulation. Such choice might have favoured the code retrieval performance of our queries. However, the same approach was carefully followed for all the existing techniques under study (Thung et al. 2013b; Nie et al. 2016; Zhang et al. 2017). Thus, they received the same treatment in the performance evaluation as ours. Furthermore, the validation results (i.e., RQ10) clearly report the superiority of our suggested queries over their counterparts from the existing techniques. Our investigation using the three contemporary web/code search engines also has drawn a similar conclusion for RACK (i.e., RQ11).

5.2 Threats to External Validity

They relate to the generalizablity of a technique. So far, we experimented using API classes from only standard Java libraries. However, since our technique mainly exploits co-occurrence between keywords and APIs, the technique can be easily adapted for API recommendation in other programming domains. Since popularity of a programming language or change proneness of an API (Linares-Vásquez et al. 2014) has a significant role in triggering discussions at Stack Overflow which are mined by us, RACK could be effective for popular languages (e.g., Java, C#) but comparatively less effective for non-popular or less used languages (e.g., Erlang).

5.3 Threats to Construct Validity

Construct validity relates to suitability of evaluation metrics. Our work is aligned to both recommendation system and information retrieval domains. We use Hit@K and Reciprocal Rank which are widely used for evaluating recommendation systems (Thung et al. 2013b; Thongtanunam et al. 2015). The remaining two metrics are well known in information retrieval, and are also frequently used by studies (Thung et al. 2013b; Chan et al. 2012; McMillan et al. 2011) relevant to our work. This confirms no or little threat to construct validity.

5.4 Threats to Statistical Conclusion Validity

Conclusion validity concerns the relationship between treatment and outcome (Linares-Vásquez et al. 2014). We answer 11 research questions in this work, and collect our data from publicly available, popular programming Q & A and tutorial sites. In order to answer these questions, we use non-parametric tests for statistical significance (e.g., Mann-Whitney Wilcoxon, Wilcoxon Signed Rank), effect size analysis (e.g., Cliff’s delta) and confidence interval analysis. We apply these tests to our experiments opportunistically and report the detailed test results (e.g., p-values, Cliff’s Δ). Thus, threats to the statistical conclusion validity might be mitigated.

6 Related Work

Our work is aligned with three research topics–(1) API/API usage recommendation, (2) query reformulation for code search, and (3) crowdsourced knowledge mining. In this section, we discuss existing studies from the literature of each of these research topics, and compare or contrast our work with them.

6.1 API Recommendation

Existing studies on API recommendation accept one or more natural language queries, and recommend relevant API classes and methods by analysing code surfing behaviour of the developers and API invocation chains (McMillan et al. 2011), API dependency graphs (Chan et al. 2012), feature request history or API documentations (Thung et al. 2013b), and library usage patterns (Thung et al. 2013a). McMillan et al. (2011) first propose Portfolio that recommends relevant API methods for a code search query by employing natural language processing, indexing and graph-based algorithms (e.g., PageRank Brin and Page 1998). Chan et al. (2012) improve upon Portfolio, and return a connected sub-graph containing the most relevant APIs by employing further sophisticated graph-mining and textual similarity techniques. Gvero and Kuncak (2015) accept a free-form NL-query, and return a list of relevant method signatures by employing natural language processing and statistical language modelling on the source code. A few studies offer NL interfaces for searching relevant program elements from the project source (Kimmig et al. 2011) or relevant artefacts from the project management repository (Lin et al. 2017). Thung et al. (2013b) recommend relevant API methods to assist the implementation of an incoming feature request by analysing request history and textual similarity between API details and the request texts. In short, each of these relevant studies above analyse lexical similarity between a query and the signature or documentation of the API for finding out candidate APIs. Such approaches might not be always effective and might face vocabulary mismatch issues given that choice of query keywords could be highly subjective (Furnas et al. 1987). On the other hand, we exploit three co-occurrence heuristics that are derived from crowdsourced knowledge, and they are found to be more effective in the selection of candidate API classes. Co-occurrence heuristics overcome the vocabulary mismatch issues (Haiduc and Marcus 2011; Furnas et al. 1987), and provide a generic, both language and project independent solution. Besides, we exploit the expertise of a large crowd of technical users stored in Stack Overflow for API recommendation which none of the earlier relevant studies did. Zhang et al. (2017) determine semantic distance between NL keywords and API classes using a neural network model (CBOW), and suggest relevant API classes for a generic NL query intended for code search. They collect their API classes from the OSS projects whereas ours are collected from Stack Overflow, the largest programming Q & A site on the web. Their work is closely related to ours. We compare with two variants of Thung et al. (2013b) and Zhang et al. (2017), and readers are referred to Sections 4.10, and 4.12 for the detailed comparison. Since (Thung et al. 2013b) outperform (Chan et al. 2012) as reported (Thung et al. 2013b), we compared with Thung et al. (2013b) for our validation.

6.2 API Usage Pattern Recommendation

Thummalapenta and Xie (2007) propose ParseWeb that takes in a source object type and a destination object type, and returns a sequence of method invocations that serve as a solution that yields the destination object from the source object. Xie and Pei (2006) take a query that describes the method or class of an API, and recommends a frequent sequence of method invocations for the API by analysing hundreds of open source projects. Warr and Robillard (2007) recommend a set of API methods that are relevant to a target method by analysing the structural dependencies between the two sets. Each of these techniques is relevant to our work since they recommend API methods. However, they operate on structured queries rather than natural language queries, and thus comparing ours with theirs is not feasible. Of course, we introduced three heuristics and exploited crowd knowledge for API recommendation, which were not considered by any of these existing techniques. This makes our contribution significantly different from all of them.

6.3 Query Reformulation for Code Search

There have been a number of studies on query reformulation that target either project-specific code search (e.g., concept/feature location (Gay et al. 2009; Hill et al. 2009; Haiduc and Marcus 2011; Haiduc et al. 2013; Howard et al. 2013; Yang and Tan 2012; Rahman and Roy 2017; Kevic and Fritz 2014b; Rahman and Roy 2016), bug localization Sisman and Kak 2013; Chaparro et al. 2017) or general-purpose code search (Gvero and Kuncak 2015; Li et al. 2016; Nie et al. 2016). Gay et al. (2009) first propose “relevance feedback” based model for query reformulation in the context of concept location. Once the initial query retrieves search results, a developer is expected to mark them as either relevant or irrelevant. Then their model analyses these marked source documents, and expands the initial query using Rocchio expansion (Rocchio 1971). Although developer feedbacks on document relevance are effective, collecting them is time consuming and sometimes infeasible as well. Therefore, latter studies came up with a less efficient but feasible alternative–pseudo relevance feedback– for query reformulation where they consider only Top-K search results (retrieved by the initial query) as the relevant ones. Then they apply term weighting (Rahman and Roy 2017; Rocchio 1971; Kevic and Fritz 2014a), term context analysis (Hill et al. 2009; Sisman and Kak 2013; Howard et al. 2013; Yang and Tan 2012), query quality analysis (Haiduc and Marcus 2011; Haiduc et al. 2013), and machine learning (Haiduc et al. 2013) to reformulate a given query for concept/feature location. Our work falls into the category of general purpose code search. Relevance feedback models were also adopted in this case for query reformulation. Wang et al. (2014) incorporate developer feedback in the code search, and improve result ranking. Nie et al. (2016) employ Stack Overflow as the provider of relevance feedback on the initial query, and then reformulate it using Rocchio expansion. Although we do not apply relevance feedback for query reformulation, the work of Nie et al. (2016) is not only closely related to ours but also relatively more recent. Another closely related recent work by Zhang et al. (2017) leverages semantic distance between NL keywords and API classes, and then expands the NL queries using semantically relevant API classes for code search. We thus compare our technique with three techniques above (Thung et al. 2013b; Zhang et al. 2017; Nie et al. 2016), and the detailed comparison can be found in RQ10. Li et al. (2016) develop a lexical database by using software-specific tags from Stack Overflow questions, and reformulate a given query using synonymy substitution. However, their approach searches for relevant software projects rather than source code segments. Campbell and Treude (2017) mine titles from Stack Overflow questions, and suggest automatic expansion to the initial query in the form of auto-completion. However, this approach also relies on textual similarity between initial query and the expanded query, and thus, is subject to the vocabulary mismatch issues. On the contrary, we overcome such issues using three co-occurrence based heuristics. Besides, their approach is constrained by a fixed set of predefined queries from Stack Overflow questions, and thus, might not help much in the formulation of custom queries. RACK does not impose such restrictions on query formulation.

6.4 Crowdsourced Knowledge Mining

Existing studies (Yuan et al. 2014; Rahman and Roy 2016; Nie et al. 2016; Li et al. 2016; Ponzanelli et al. 2014; Sirres et al. 2018) leverage crowd generated knowledge to support several search related activities performed by the developers. Yuan et al. (2014) first used programming questions and answers from Stack Overflow to identify semantically similar software specific word pairs. They first construct context of each word by collecting co-occurred words from Stack Overflow questions, answers and tags. Then they determine the semantic similarity between a pair of NL words based on the overlap between their corresponding contexts. Such word pairs might help in addressing the vocabulary mismatch issues with web search. However, they might not help much with code search given that source code and regular texts often hold different semantics for the same word (Yang and Tan 2012; Bajracharya and Lopes 2012b). Wong et al. (2013) mine developer’s descriptions of the code snippets from Stack Overflow answers, and suggest them as comments for similar code segments. Rigby and Robillard (2013) mine posts from Stack Overflow, and extract salient program elements using regular expressions and machine learning. Along the same line with the earlier studies, we mine Stack Overflow questions and answers to reformulate a given natural language query for code search. While our work is related to earlier studies (Nie et al. 2016; Li et al. 2016), it is also significantly different in many ways. First, we suggest relevant API classes for a NL-query by considering keyword-API co-occurrences whereas (Nie et al. 2016) suggest mostly natural language terms as query expansions by employing pseudo-relevance feedback. Li et al. (2016) reformulate queries using crowd wisdom from Stack Overflow for searching open source projects whereas our queries are targeted for more granular software artefacts, e.g., source code snippets. Furthermore, we suggest relevant API classes in contrast with synonymous NL tags by Li et al. (2016), which are more appropriate and effective for code search (Bajracharya and Lopes 2012b). Another contemporary work (Sirres et al. 2018) uses all program artifacts indiscriminately from Stack Overflow posts for expanding code search queries which could be noisy. On the contrary, we leverage co-occurrences between NL keywords (in the question title) and API classes (in the accepted answer) as a proxy to their relevance, and choose appropriate API classes only for our query reformulation.

Our work in this article also significantly extends our earlier work (Rahman et al. 2016) in various aspects. We improve earlier heuristics by extensively calibrating their weights and thresholds, and introduce a novel heuristic– Keyword Pair API Co-occurrence– that performs better than the earlier ones. We conduct experiments with a relatively larger dataset containing 175 distinct queries, and further evaluate them in terms of relevant code retrieval performance which was missing in the earlier work. We not only compare with several state-of-the-art studies but also demonstrate RACK’s potential for application in the context of traditional web/code search practices. Furthermore, we extend our earlier analysis and answer 11 research questions as opposed to seven questions answered by the earlier work.

7 Conclusion & Future Work

To summarize, we propose a novel query reformulation technique–RACK–that suggests a list of relevant API classes for a natural language query for code search. It employs three novel heuristics, and collects the the relevant API classes by exploiting crowdsourced knowledge stored in Stack Overflow questions and answers. Experiments using 175 code search queries from three Java tutorial sites show that RACK recommends relevant APIs with 83% Hit@10, 46% precision and 54% recall which are highly promising. Reformulated queries based on our recommended APIs significantly improve the baseline queries in terms of code retrieval performance. Comparison with the state-of-the-art techniques shows that our technique outperforms them not only in relevant API suggestion but also in query reformulation for code search by a significant margin. Furthermore, our technique is generic, project independent, and it exploits invaluable crowd generated knowledge for relevant API suggestion. Our work in this article has opened up the following future research directions:

  • Determining Relative API Salience: Each programming task requires one or more API classes where some classes (e.g., MimeMessage) are more important than others (e.g., Properties) for the task (e.g., “How do I send an HTML email?”). However, based on our experience from this study, such relative importance is task-sensitive and sometimes even subjective. Given that code search queries are short and provide very little contexts about the task, determining the relative API salience is even more challenging. While we attempt to address this issue using three novel heuristics derived from crowd generated knowledge, further work is warranted (1) to better understand the issue, and (2) to return more effective ranking for the suggested API elements.

  • Query Quality Analysis: Given multiple natural language queries for the same programming task, determining the best one without executing them is a challenging task. Identification of the best query could help the developers avoid numerous trials and errors or even performance regression. Information retrieval and Concept/feature location communities have long strived to address this challenge using several query quality/difficulty metrics and machine learning (Carmel et al. 2006; Carmel and Yom-Tov 2010; Haiduc et al. 2013; Haiduc and Marcus 2011). Since we leverage keyword-API associations in this work for relevant API suggestion, such associations could possibly be leveraged for query quality estimation as well.