Keywords

1 Introduction

With the rapid development of mobile technologies, an increasing number of mobile apps have been developed and published. Similar to the traditional software development, the development of mobile apps also starts from requirements elicitation, where the quality of requirements plays a key role to assure the success of the software [1]. Since the approach of feature-oriented domain analysis (FODA) is proposed [2], the feature-oriented approach has been widely used in software development by software practitioners. According to IEEE standard glossary of software engineering terminology [3], a feature is defined as “a software characteristic specified or implied by requirements documentation”. Due to the close relationship between requirements and features, extracting appropriate features will contribute to requirements elicitation, which will in turn promote the success of software development.

Various sources can be used to extract feature requests. For example, online open forums have been used to elicit features by project managers [4, 5]. Domain knowledge coming from domain experts can also be utilized to elicit features by recommending proper expert stakeholders [6, 7]. In addition, descriptions of online software products can also be leveraged to elicit software features [8, 9]. In particular, reviews can be viewed as a way for users to collaboratively propose feature requests for a certain mobile app. So many works have been conducted towards extracting feature requests from app reviews. For example, an unsupervised information extraction system named OPINES is proposed in [10], which builds a model of important product features by mining reviews. Various information retrieval techniques such as topic modeling are leveraged to extract topics and representative sentences of those topics from user comments, which will be used to revise requirements for next releases of software [11]. A prototype named mobile app review analyzer (MARA) for automatic retrieval of mobile app feature requests from app reviews is designed in [12]. As a whole, these approaches mainly leverage information retrieval techniques in identifying feature requests from user reviews and they do not classify app reviews in advance. An automated approach that helps developers filter, aggregate, and analyze user reviews is proposed in [13]. However, they mainly focus on sentiment analysis on reviews and feature requests mining is not the focus of their work. Several classification algorithms are compared in [14] to classify app reviews into four types: bug reports, feature requests, user experiences, and ratings. They comprehensively classify app reviews, but they do not consider the characteristics such as linguistic rules that are specific to feature requests.

An approach of extracting feature requests from app reviews is presented in this paper. We focus on how to select appropriate classification attributes and an optimal classification algorithm to identify feature requests from the app reviews. Specifically, various classification attributes such as bag of words, linguistic rules and metadata (e.g., rating, tenses and sentiment) are analyzed. Four classification algorithms including J48, Naive Bayes, Random Forest [15] and SVM (Support Vector Machine) [16] are compared to select an optimal classifier. Then LDA (latent Dirichlet allocation) [17] is used to cluster reviews on feature requests. Finally, phrases that represent feature requests are extracted by using the Stanford Parser [18], a tool that can generate word dependencies of sentences, based on the clustered topics and terms.

The main contributions of our work are as follows.

  • An approach of extracting feature requests from app reviews is proposed. Various possible selected classification attributes from raw reviews and classification algorithms are discussed. In addition, we use LDA to cluster reviews on feature requests into various groups. Word dependencies are used to extract phrases that represent feature requests based on the clustered result.

  • Experiments on a real world data set are conducted to identify reviews on feature requests and extract representative feature requests.

The remainder of the paper is organized as follows. Section 2 introduces feature requests extraction approach in detail. The evaluation of the proposed approach is discussed in Sect. 3. Section 4 discusses related work and we conclude the paper in Sect. 5.

2 Feature Requests Extraction

2.1 Overview of the Approach

An overview of the approach is described in Fig. 1, which mainly involves three steps:

Fig. 1.
figure 1

Overview of feature requests extraction

Step 1: The objective of this step is to identify which reviews belong to feature requests using classification techniques. As depicted in Fig. 1, a classifier is trained by selecting appropriate classification attributes from these reviews. Then the classifier is utilized to predict whether unlabeled reviews belong to feature requests. To improve the prediction performance, it is important to select appropriate classification algorithms together with attributes.

Step 2: After raw reviews are classified, reviews on feature requests are clustered into semantically similar groups. Clustering algorithms have been widely used in mining features in the field of feature model extraction [8, 9] and discovering Web service from text descriptions [19, 20]. In this paper, LDA, a widely used topic model, is adopted to cluster these reviews based on identified latent topics.

Step 3: For each topic, the highly relevant reviews are selected, and then verb-noun phrases and noun phrases are extracted by analyzing from word dependencies of these selected reviews using the Stanford Parser. Finally, the phrases are filtered and selected as feature requests based on the relevance between its contained terms with the topic.

The extracted feature requests can be viewed as new requests of the mobile app, and therefore they can also serve as a basis of the mobile app evolution and the feature model change, which means that they will be developed or reused in the next release of the mobile app. Due to the space limitation, in this paper, we focus more on selecting appropriate classification algorithms and classification attributes.

2.2 Classification Attributes Selection

A review extracted from the Apple Store and the Google Play store [21] usually consists of the following attributes: app Id, review Id, review title, review comment, rating, reviewer, fee, date, and data source. However, not all the attributes are useful to train a classifier on identifying feature requests. Therefore it is necessary to select useful information from the review.

The title and comments are basic attributes of an app review, which can be treated as a document. In the document classification, bag of words (abbr. BW) are the basic classification attributes. The vectorization process of BW is usually described as follows: firstly a dictionary which includes all terms of reviews in the corpus is created; next, whether a term appears in the review and how often it appears is counted; and finally the TF-IDF of each term in a review is calculated. Some natural language processing techniques such as stop words removal and lemmatization are usually used during the process. In this paper, BW refers to bag of words together with stop words removal and lemmatization.

According to the analysis of the manually identified feature requests in the random sample described in [12], some keywords used for defining linguistic rules on feature requests have been identified in the title or comments. In order to reflect linguistic rules by using these keywords, they are classified into three categories: modal verbs (abbr. MV), general verbs or nouns (abbr.VN), and preposition phrase (abbr. PP), as shown in Table 1.

Table 1. Keywords in reviews on feature requests

TF-IDF of each category is calculated to quantify the textual attributes. TF of a category in a review is the ratio between the number of keywords of the category occurred in the review and the total number of words in the review. IDF of a category in a review is a logarithm between the number of all reviews and the number of reviews containing any keyword of this category. TF-IDF of a category is the product of the TF and IDF score of the category. They are calculated using Eqs. (1), (2) and (3), respectively.

$$\begin{aligned} \ TF(c,r)=\frac{\sum _{k\in c}\#\,\,of\,k\,occurs\,in\,r}{\#\,of\,words\,in\,r}, \end{aligned}$$
(1)
$$\begin{aligned} IDF(c)=log\frac{\#\,of\,reviews}{\sum _{k\in c}\#\,of\,reviews\,containing\,k}, \end{aligned}$$
(2)
$$\begin{aligned} TF\text {-}IDF(c,r)=TF(c,r)\times IDF(c), \end{aligned}$$
(3)

where, k represents a keyword in a category c (MV, VN, or PP) and r represents a review. In addition, we use LR to represent the combination of MV, VN, and PP.

The metadata such as star rating, tenses of the verbs, and reviewer sentiment can be extracted from app reviews. The star rating is a numeric value between 1 and 5 given by the reviewer, which will be used as a classification attribute. The tenses of verbs which occur in the review is also selected as a classification attribute because the future tense reflects a larger possibility on an enhancement of the app or a new feature request. Different from [14] which used past, present, and future tenses by part of speech tagging provided in NLP libraries, we only distinguish the future tense and the non-future tense in this paper. The reviewer sentiment reflects the positive and negative emotions of the reviewer [13]. Thelwall et al. [22] propose a fine-grained sentiment extraction approach, where one negative sentiment score in a scale of −5 to −1 and one positive score in a scale of 1 to 5 are assigned for each review. Similar to [13, 14], an absolute score combined by negative and positive scores is used as a classification attribute.

2.3 Feature Requests Clustering and Extraction

As one of the most widely used topic models, LDA can be used to extract unobserved factors that capture the underlying domain semantics within the given documents. Once the reviews on feature requests are identified, LDA is leveraged to cluster these reviews on feature requests and identify the latent topics among them. More specifically, according to the distribution of topics in these reviews and the distribution of terms in topics generated by LDA, we can cluster the reviews where the highly relevant reviews on each topic are grouped together and we can also identify the highly relevant terms of each topic.

In our opinion, the feature requests can be represented in the form of verb-noun phrases, e.g.,“update screen”, or noun phrases, e.g., “picture upload”. Next, we pay our attention to extracting this kind of feature requests from the clustered reviews.

Inspired by our previous work on service goal extraction [23], in this paper we leverage the Stanford Parser [18] to extract the feature requests. The Stanford Parser can be used to perform linguistic analysis of sentences contained by reviews. The main linguistic analysis result we used is word dependencies, which describe the binary relations between words within a sentence. For example, amod(option-2, File-1) and compound(upload-6, picture-5) are two word dependencies in the review “File option such as picture upload will be loved”. Based on these two word dependencies, we can get two potential feature requests: file option and picture upload. The Stanford Parser provides about 50 word dependencies, where we currently use a subset of them to extract feature requests, as shown in Table 2.

Table 2. Used word dependencies

For each topic, we can extract feature requests from the reviews that are highly relevant to the topic using the above mentioned approach. Note that not all the phrases extracted from the word dependencies will be appropriate feature requests relevant to the topic. A phrase can be viewed as a candidate feature request relevant to a topic if and only if it contains at least one highly relevant term of that topic. These candidate feature requests can be ranked according to their frequencies in the topic and the probabilities of its contained terms over the topic. In this way, we can get the feature requests from the reviews.

3 Evaluation

In this section, we evaluated the proposed extraction approach by a series of experiments. All the experiments are conducted on a PC with 3.19 GHz Intel Core i3 CPU and 4 GB RAM, running Windows 7 OS.

3.1 Experiment Data

In the experiments, we used the data set provided in [14], which was extracted from the Apple AppStoreFootnote 1 and the Google PlayFootnote 2. In the data set, each review has the attributes of comment text, title, app name, category, store, submission date, and username. Furthermore, the metadata such as star rating, tenses of verbs and sentiment of the reviews were also extracted. Moreover, the types of reviews were manually analyzed and labeled, which were set as the grounding truth. In the data set, due to the great effort of manually labeling, 1924 reviews were labeled, where 295 reviews were feature requests, 600 reviews were non-feature requests, and 1029 reviews were labeled as other types such as bug reports, user experiences and ratings.

3.2 Evaluation Indicator

In order to evaluate the performance of various classification algorithms under different classification attributes, the standard metrics precision, recall and F-measure are used. In this paper, Precision is the fraction of reviews that are correctly classified to feature requests. Recall is the fraction of reviews on feature requests which are classified correctly. F-measure is a harmonic mean function of precision and recall. They are calculated by Eqs. (4), (5), and (6), respectively.

$$\begin{aligned} precision=TP/(TP+FP), \end{aligned}$$
(4)
$$\begin{aligned} recall=TP/(TP+FN), \end{aligned}$$
(5)
$$\begin{aligned} F\text {-}measure=\frac{2\times precision \times recall}{precision+recall}, \end{aligned}$$
(6)

where, TP is the number of reviews that are classified as feature requests and actually are feature requests. FP is the number of reviews that are classified as feature requests but actually are not feature requests. FN is the number of reviews that are classified into non-feature requests but actually belong to feature requests.

3.3 Results and Analysis

Firstly, a group of experiments are conducted in order to evaluate the performance of various classification algorithms under different classification attributes. The values of precision, recall, F-measure and execution time are compared by using various classification algorithms such as Naive Bayes, SVM, J48, and Random Forest under different classification attributes. In order to reduce the sensitivity of the data, the mean values of precision (abbr. pre), recall (abbr. rec), F-measure (abbr. F1) and the execution time (abbr. time)under equal scale of training and testing data were calculated for five times. Every time 75% of all the reviews labeled as feature and non-feature requests were randomly selected as training data and the remaining 25% reviews were selected as testing data. The result of this group of experiments is shown in Table 3 (please note that rat, ten and sen denotes rating, tenses and sentiment respectively).

Table 3. Comparison of various algorithms under different attributes

As can be seen from Table 3, if BW is solely used as the classification attribute, precision, recall and F-measure of each classification algorithm is on the scale of 63.5% to 70.9%, 52.7% to 68.9% and 60.5% to 69.4%, respectively. Wherein, SVM can get the best precision, but its recall is the smallest. Naive Bayes can achieve the best classification results on the whole and its executing time is the shortest. Random Forest is suboptimum on the whole but its executing time is the longest. If LR is solely used as the classification attribute, we find that precision,recall and F -measure of each classifier is about 57.7%, 60.8%, and 59.2%, respectively. It is obvious that the F-measure of using BW is generally superior to that of using LR, but the executing time is the opposite. The reason is that BW consists of much more classification information meanwhile it has far more dimensions to compute for training a classifier. If BW and LR are used as classification attributes together, precision, recall and F-measure of each classifier are on the scale of 66.7% to 76.8%, 58.1% to 75.7%, and 65.6% to 75.2%, respectively. Four classifiers have the similar comparison of classification results with only BW being used as the classification attribute. Each algorithm can get better results by using both of them to show that BW and LR can get mutual supplement for training a classifier.

Furthermore, adding the metadata such as rating, tenses and sentiment can effectively improve the performance of each classification algorithm. According to the results, using tenses can get better performance than using the other two because future tense and non-future tense are considered together. When LR and the metadata are used as classification attributes, J48 can get better F-measure than the other three since J48 is more suitable for the sample with the smaller size. SVM can often get the best precision, but its recall is almost the lowest, so its F-measure is not so good. The performance of Random Forest is rather moderate and its execution time is the longest. The performance of Naive Bayes is superior to other classifiers if BW is used as one of classification attributes. When BW, LR and metadata are used as classification attributes, Naive Bayes can get the best results and both of its precision and recall can reach 82.4%. Additionally, the executing time of Naive Bayes is the shortest.

Afterwards, LDA is conducted on the reviews on feature requests. Table 4 depicts some topics and top five representative terms in each topic. Wherein, each column represents a topic and each row shows terms and their probabilities in the corresponding topics.

Table 4. Some topics and representative terms

Finally, for each topic, phrases are extracted from the clustered reviews by using the Stanford Parser. The top ranked feature requests in the data set are as follows: update screen, fix game, game phone, add button, file download, open video, file option, picture upload, easy way, and proper version. Clearly most of the identified feature requests are meaningful and can help requirements analysts in identifying new evolution requirements from app reviews. On the other hand, some resulting feature requests such as easy way and proper version are not satisfactory. How to further improve the identified feature requests using more word dependencies and more filtering rules will be our future work.

3.4 Threats to Validity

With respect to the internal validity, the main threat is that the proposed approach mainly considers various classification algorithms under different classification attributes for identifying whether a review belongs to feature requests. But it is also important to select proper clustering algorithms for grouping similar reviews into feature requests.

Threats to external validity concern the selection of keywords that occur in the reviews on feature requests. Since only some familiar keywords are selected for training a classifier, the value of recall is not high. Additionally, the scale of the experiments data needs to be extended. Due to the difficulty of getting the truth of the type of the app reviews by manually labeling, we only select 1924 reviews for the experiments. It inevitably limits the verification experiments on the performance of the classification algorithms.

4 Related Work

Many works have been presented to extract feature requests. Laurent et al. [5] explore the use of online forums to conduct the requirements engineering tasks of the open source projects which was led by the software vendor. Castro-Herrera et al. [6] present a hybrid recommender system to identify potential users who might be capable of responding to unanswered posts in open source forums. Castro-Herrera et al. [7] utilize the organizer and promoter of collaborative ideas (OPCI) recommendation system to recommend expert stakeholders of the field and the requirements are elicited by means of the domain knowledge from these expert stakeholders. These approaches focus on the problem of finding the proper stakeholders to participate in the process of requirements elicitation. Hariri et al. [8] and Dumitru et al. [9] leverage the data mining techniques to extract the common features from online products description and design a recommender system to elicit missing features.

App reviews have also been used to extract feature requests. Popescu et al. [10] introduce an unsupervised information extraction system named OPINE to build a model of important product features. Galvis Carreo et al. [11] adapt topic modeling from user comments to extract the topics mentioned and some sentences representative of those topics. These approaches do not consider sentiment of the user and they do not address the problem of the mis-classification or mix of topics. Iacob et al. [12] design a prototype named mobile app review analyzer (MARA) to automatic retrieve mobile app feature requests from app reviews, where they manually define linguistic rules and identify feature requests from reviews which match at least one linguistic rule. Because only part of the linguistic rules are listed in their paper, it is difficult to quantitatively compare their approach with ours. Their approach needs much manual labors and the output of their approach is the corresponding keywords relevant to the feature requests. In our approach, we use linguistic rules and bag of words as classification attributes to train a classifier for identifying reviews on feature requests. We also use the Stanford Parser to extract phrases to represent feature requests, which can be easily understood by users. Guzman et al. [13] propose an automated approach that helps developers filter, aggregate, and analyze user reviews. But they mainly focus on the emotion analysis on reviews and feature requests mining is not their research task. Maalej et al. [14] introduce several probabilistic techniques such as string matching, text classification, NLP (Natural Language Processing) and sentiment analysis, and compare the classification algorithms including Naive Bayes, Decision Tree and maximum entropy (MaxEnt) to classify app reviews into four types: bug reports, feature requests, user experiences, and ratings. They classify app reviews comprehensively, but they do not consider the characteristics such as linguistic rules only for feature requests. In contrast to their works, we add linguistic rules besides bag of words and metadata as classification attributes to identify the reviews on feature requests. Furthermore, we adopt the Stanford Parser to extract representative phrases as feature requests based on clustered topics and top relevant terms of each topic.

5 Conclusion and Future Work

An approach of extracting feature requests from app reviews is proposed in this paper. The approach can be applied in the early stage of software requirements engineering, which can be a supplement for mining features from the description of mobile apps. In order to accurately extract feature requests from reviews, it is important to identify whether a review belongs to feature requests. Therefore, different classification algorithms are compared under various classification attributes, which are validated by leveraging a real world data set of reviews from the AppleStore and the Google Play stores. In addition, phrases that represent feature requests are extracted by using word dependencies based on the clustered reviews.

In the future, we plan to extend our work from the following directions. Firstly, we will further investigate how to extract more meaningful phrases as feature requests from reviews. Secondly, we plan to evaluate the performance of various classification algorithms under different classification attributes when the experimental data set grows to a larger scale.