1 Introduction

The rapid growth of the mobile industry in the past decade has led to a paradigm shift in the way software is being produced and consumed (Petsas et al. 2013). As mobile technology is becoming more accessible, more consumers are migrating to their smart phones and tablets to handle their day-to-day computing activities. In response to this growth, mobile application (app) stores (e.g., Google Play and the Apple App Store) have emerged as a new model of online distribution platforms (Basole and Karla 2012). These stores have expanded in size in the past 5 years to host millions of apps, offering end-users of software virtually unlimited options to choose from. For instance, as of March 2017, the Apple App Store alone has reported around 2.20 million active apps, growing by over 1000 apps per day.Footnote 1

Similar to conventional online markets (e.g., Amazon and eBay), mobile app stores enable their customers to share their app experience in the form of textual reviews and star ratings. This unique channel of user feedback has created an unprecedented opportunity for app developers to directly monitor the opinions of large and heterogeneous populations of end-users of their software (Pagano and Maalej 2013). In fact, analyzing large datasets of app store reviews has revealed that they contain a substantial amount of up-to-date technical information. Such information can be leveraged by app developers to help them maintain and sustain their apps in a highly-competitive and volatile market (Pagano and Maalej 2013). An underlying tenet is that user involvement in the software production process is a major contributing factor to software success (Bano and Zowghi 2015).

Realizing the technical and market value of app store feedback, a plethora of methods and tools have been proposed in the literature to automatically capture and categorize informative user reviews (Carreńo and Winbladh 2013; Chen et al. 2014; Maalej and Nabil 2015; Pagano and Maalej 2013; Villarroel et al. 2016). In general, app store mining techniques rely on the textual attributes of user reviews to classify them into fine-grained software maintenance requests, including feature requests and bug reports. Such techniques range from detecting the presence and absence of certain indicator terms (e.g., “crash”, “bug”), to more computationally expensive methods that rely on text modeling and classification techniques (Carreńo and Winbladh 2013; Guzman and Maalej 2014; Maalej and Nabil 2015; Panichella et al. 2015). However, while these techniques have shown decent accuracy levels, they typically suffer from several drawbacks. For instance, users tend to express their reviews using informal language, including colloquial terminologies and other neologisms. Such a wide spectrum of words often results in complex text classification models, which in turn might lead to overfitting problems. In particular, due to the rapid manner in which natural language evolves online, a classifier trained using a vocabulary collected at a certain point in time might not be able to accurately generalize for unseen-before reviews (McCallum and Nigam 1998).

To address these challenges, in this paper, we propose a novel semantically aware approach for mining and classifying user reviews on mobile app stores. Our approach is based on the notion of semantic role labeling (SRL). The primary assumption behind SRL is that words can be grouped into semantic classes, called frames. A semantic frame describes an event that occurs in a sentence along with its participants (e.g., people and objects). The goal is to capture the meaning of a sentence at a higher level of abstraction. More specifically, by annotating words and phrases in the text with various frame elements (or roles), we can generalize from specific sentences to scenarios. Such annotations can be generated using the FrameNet project (Baker et al. 1998). FrameNet provides an online lexical repository of semantic frames and their roles.

SRL and frame semantics have been successfully used in various text mining tasks, such as predicting the stock market movement by analyzing the textual content of financial news articles (Xie et al. 2013), extracting social networks from unstructured text (Agarwal et al. 2014), question answering tasks (Shen and Lapata 2007), and stance classification in political debates (Hasa and Ng 2013). Following this line of research, in this paper, we investigate the performance of frame semantics in supporting basic app store review mining algorithms. Our objective is to describe a series of light-weight and accurate algorithms for identifying, classifying, and summarizing informative user reviews into different groups of actionable software maintenance requests. Our analysis is conducted using a dataset of app reviews that is sampled from a broad range of application domains (Chen et al. 2014; Jha and Mahmoud 2017b; Maalej and Nabil 2015). Specifically, the work presented in this paper extends our previous work in Jha and Mahmoud (2017b) as follows:

  • Review summarization: Popular mobile apps, hosted at multiple app stores, often generate thousands of informative reviews. Presenting such a large, and typically redundant, amount of raw user reviews to developers can cause confusion. This emphasizes the need for automated methods to identify and summarize the most pressing issues in the reviews to enable a more effective data exploration process. To address this need, we evaluate the performance of various text summarization algorithms in the context of mobile app reviews. Our set of algorithms consists of Hybrid TF and Hybrid TF.IDF (Inouye and Kalita 2011), SumBasic (Nenkova and Vanderwende 2005), and LexRank (Erkan and Radev 2004). These algorithms (except for LexRank) were used in our previous work to summarize software-relevant tweets available on the Twitter feeds of several popular software systems (Williams and Mahmoud 2017).

  • Tool support: In Jha and Mahmoud (2017a) we introduced MARC, a M obile A pp R eview C lassifier that implemented our algorithms in Jha and Mahmoud (2017b). In this extension, we introduce MARC 2.0, a new release of MARC that is equipped with an enhanced GUI, a summarization engine, and a more accurate and efficient classification engine.

  • Enhanced text and discussion: The content of the paper is significantly enhanced by adding more thorough discussions of our proposed methods, experimental analysis, related work, and threats to validity.

The remainder of this paper is organized as follows. Section 2 introduces the FrameNet project and the notion of semantic frames. Section 3 describes our review classification process. Section 4 describes and evaluates our review summarization algorithms. Section 5 presents MARC 2.0. Section 6 reviews seminal work related to our work in this paper. Section 7 discusses the main threats to validity. Finally, Section 8 concludes the paper and discusses prospects of future work.

2 Frame Semantics

Housed and maintained by the International Computer Science Institute in Berkeley, California, the FrameNet project (Baker et al. 1998) provides a massive machine readable database of manually annotated sentences based on the theory of Frame Semantics (Fillmore 1976). This theory states that the meanings of lexical items (predicates) are best defined with respect to larger conceptual chunks, called Frames. Technically, the FrameNetFootnote 2 project works to identify significant frames in sentences, their frame elements, and lexical units. A semantic frame (or simply frame) can be described as a schematic representation of a situation (events, actions) involving various elements. A frame element (FE) can be defined as a participant entity or a semantic role in the action described by the frame. Lexical units (LU) are basically the words that evoke the different frame elements. For instance, the frame commerce_buy describes a basic commercial transaction involving a buyer and a seller exchanging money and goods. This frame has the core frame elements buyer (can be evoked by lexical units such as buy) and goods. A core FE is an element that is necessary for the frame to occur. The commerce_buy frame also has several non-core FEs, such as place, purpose, seller, and time.

Figure 1 shows the semantic annotation of the sentence “John bought a car from Kristina in June.” under the semantic frame commerce_buy. This sentence contains the frame elements buyer, goods, seller, and time, evoked by the lexical units John, car, Kristina, and June respectively. Another example is the sentence “He traveled to Germany to buy a car”, shown in Table 1. This sentence is annotated under the semantic frames Travel, Commerce_buy, and Vehicle. The Travel semantic frame has the elements traveler and goal, evoked by the words he and to Germany respectively. The Commerce_buy frame has the elements buyer and goods, evoked by the words he and car respectively and the frame Vehicle has the element vehicle, evoked by the word car.

Fig. 1
figure 1

The semantic annotation of the sentence “John bought a car from Kristina in June.”

Table 1 A color-coded tabular representation of the semantic annotation of the sentence “He traveled to Germany to buy a car”

This unique form of semantic annotation represents an invaluable source of knowledge that can be exploited to support several text processing tasks. For example, the FrameNet database has been used in tasks such as semantic classification of text (Fleischman et al. 2003), question answering (Sinha 2008) and information extraction (Moschitti et al. 2003). Following this line of research, we utilize the FrameNet project to tackle the problem of mining user reviews in app stores. Our expectation is that FrameNet tagging will enable a deep understanding of the meaning of individual user reviews. This in turn should help to generate more accurate app review mining algorithms. Consider, for example, the sentence “I can’t see the pictures fix it please!!” extracted from a review of the photo-sharing app Imgur. Tagging this sentence using FrameNet results in the following frames:

figure c

The key semantic frame in this example is Predicament, which refers to a situation where “An Experiencer is in an undesirable Situation, whose Cause may also be expressed”. This frame can also be evoked by other words such as problem, trouble, and jam. In general, any situation of inconvenience might evoke this frame. From a classification point of view, this frame represents a feature that can be used to predict bug reports.

Another example is the two review sentences “I wish you could add a functionality to use this app with any POP3 mailboxes.” and “I wanted to be able to use Gmail with all POP3 mailboxes.” extracted from two different reviews of the Gmail app. Both sentences convey the same message, describing a feature request to support all POP3 mailboxes, but with different terminologies. Tagging these two sentences using FrameNet generates the following representations:

figure e
figure f

In the first sentence, the words wish, could, add, use, and any evoke the frames Desiring, Capability, Statement, Using, and Quantity respectively. In the second sentence, the words wanted, able, use, and all evoke the frames Desiring, Capability, Using, and Quantity respectively. This example shows how similar frames are evoked by different words that share the same meaning in a specific context. For instance, in the above two sentences, the words wish and wanted are two different words that share the same meaning in the given context, and therefore, they both evoke the frame Desiring. Similarly, the words could and able evoke the semantic frame Capability in both sentences.

This form of semantic abstraction is expected to enhance the predictive capabilities of text classifiers. In particular, in text classification tasks, each individual word of the text is treated as a separate classification feature, such that the input text is represented as an unordered vector of its words. This approach, known as the Bag-of-Words (BOW) classification, relies on the presence or absence of certain indicator terms in the text to make a decision. For instance, in the context of app review classification, words such as {bug, crash, fix, problem, issue, defect, solve, trouble} tend to be associated with bug reporting reviews, while words such as {add, please, would, hope, improve, miss, need, prefer, suggest, want, wish} are typically associated with feature requests or user requirements (Maalej and Nabil 2015). Such words are used by text classifiers to classify the input text under a certain label. In contrast, the approach we present in this paper can be described as a Bag-of-Frames, or BOF, approach. In particular, the frames generated for each review, rather than its word, are used as classification features to represent the text (i.e., vector of frames). Our assumption is that the BOF representation of the data will generate lower dimensional, and thus, potentially more accurate models. In what follows, we examine the impact of using semantic frames on two basic review mining tasks, including review classification and summarization.

3 App Review Classification

Under this phase of our analysis, we examine the impact of using frame semantics on the accuracy of text classifiers that are commonly used in app review classification tasks. In what follows, we describe our experimental setup, including the dataset used to conduct our analysis, the classification algorithms used to classify the data, and the measures used to assess the performance of these algorithms under different classification configurations.

3.1 Experimental Dataset

Our ground-truth dataset of app reviews is compiled from three different datasets.Footnote 3 Around 25% of our reviews were randomlyFootnote 4 sampled from the data collected by Maalej and Nabil (Maalej and Nabil 2015) and 50% were sampled from Chen et al.’s dataset (Chen et al. 2014). The remaining 25% of reviews were collected locally from the iOS apps CreditKarma, FitBit, and Gmail. Using such a diverse data enhances the internal and external validity of our results by reducing any potential sampling bias, a problem that is commonly known as the app sampling problem (Martin et al. 2015).

For our local dataset, the most recent user reviews of each app were extracted using the RSS feed generator of the iOS app store. These reviews were manually classified by the authors and an external judge into feature requests, bug reports, and otherwise. Majority voting was used to determine the final class of each review. Furthermore, the data sampled from Maalej and Nabil (2015) and Chen et al. (2014) were re-examined by the researchers to ensure that their classification was consistent with our classification scheme. For example, the review “Just un install and reinstall Works Awesome now Love this app probably best ever!” from Maalej and Nabil (2015) was classified as a bug report based on its title (“Crash and will not open FIX”). In our analysis, we did not consider the titles of the reviews. Therefore, the classification of this review was changed to uninformative (i.e., otherwise). In total, our classification disagreed with the original classification of the external datasets in less than 3% of the total number of reviews.

In a few cases, some reviews were labeled differently by each judge and further discussion among the judges did not lead to a clear-cut label. For instance, the review “love the game a little hard to play on a not-so-fast wifi” was classified as a bug report by one judge, a feature request by the second judge, and otherwise by the third judge. A discussion among the judges did not lead to an agreement on the final label, thus the review was removed. In total, 13 instances were discarded from our dataset. Table 2 summarizes the characteristics of our dataset, including the number of bug reports, feature requests, and otherwise instances collected from each source.

Table 2 The collection of datasets used in our analysis

3.2 Classifiers

To classify our data, we use Support Vector Machines (SVM) and Naive Bayes (NB). Both algorithms are commonly used in text classification research (Cai and Hofmann 2004; Dumais and Chen 2000; Joachims 1998; Kim et al. 2006; Sebastiani 2002), and have been reported to outperform other classifiers in short-text classification tasks (e.g., Twitter data (Guzman et al. 2016; Williams and Mahmoud 2017), YouTube comments (Poché et al. 2017), and app user reviews (Guzman et al. 2015; Maalej and Nabil 2015; Panichella et al. 2015; Wang and Manning 2012)). In what follows, we describe these algorithms in greater detail:

  • Support Vector Machines (SVM): SVM is a supervised machine learning algorithm that is used to recognize patterns in multidimensional data spaces (Burges 1998). SVM tries to find optimal hyperplanes for linearly separable patterns in the data and then maximizes the margin around the separating hyperplane. Technically, support vectors are the critical elements of the training set that would change the position of the dividing hyperplanes if removed. SVM classifies the data by mapping input vectors into an N-dimensional space, and deciding in which side of the defined hyperplane the point lies. Support Vector classifiers have been empirically shown to be effective in high dimensional and sparse text classification tasks (Joachims 1998).

  • Naive Bayes (NB): NB is a simple, yet efficient, linear probabilistic classifier that is based on Bayes’ theorem (Langley et al. 1992). NB is based on the conditional independence assumption which implies that the attribute values of the data are independent of each other given the class. In the context of text classification, the features of the model are the individual words of the text artifacts. Such data are typically represented using a 2-dimensional word x document matrix. The entry i,j in the matrix can be either a binary value that indicates whether the document di contains the word wj or not (i.e. {0,1}), or the relative frequency of the word wj appearing in the document di (McCallum and Nigam 1998).

3.3 Implementation and classification settings

In our analysis we use Weka,Footnote 5 a data mining software that implements a wide variety of machine learning and classification techniques. SVM is invoked through Weka’s SMO, which implements John Platt’s sequential minimal optimization algorithm for training a support vector classifier (Platt 1998). To evaluate our classifiers, we use 10-fold cross validation. This method of evaluation creates 10 partitions of the dataset such that each partition has 90% of the instances as a training set and the remaining 10% as an evaluation set. The evaluation sets are chosen such that their union is the entire dataset. The benefit of this technique is that the results exhibit significantly less variance than those of simpler techniques, such as the holdout method (i.e., using 70% of the data for training and 30% for testing) (Kohavi 1995).

To generate the BOF representation of our data (i.e. annotate the review sentences), we use SemaforFootnote 6—a probabilistic frame semantic parser (Das et al. 2010). Semafor automatically processes English sentences and generates their semantic annotations in a special XML format. We created a special parser to extract the semantic frames of each annotated sentence from the XML output.

For the BOW analysis, we used the IteratedLovinsStemmer provided in Weka to stem the reviews in our dataset (Lovins 1968). Stemming reduces words to their morphological roots. This leads to a reduction in the number of features (words) as only one base form of the word is considered. Most common words (words that appear in all reviews) along with the words that appear in only one review were removed from the data. These words are highly unlikely to carry any generalizable information. English stop-words were not removed from our data. This decision was based on the previous observation that some of these words (e.g., would, should, will) can carry important distinctive information for feature request reviews (Maalej and Nabil 2015; Panichella et al. 2015). For instance, several of these requests start with phrases such as “would you”, “could you please”, or “why don’t you”. Therefore, removing such words might lead to a decline in the predictive capabilities of the classifier.

Furthermore, in our analysis, we use Multinomial NB, which uses the normalized frequency (TF) of words in their documents (McCallum and Nigam 1998). Multinomial NB is known to be a more robust text classifier, consistently outperforming the binary feature model (Multi-variate Bernoulli) in highly diverse real-world corpora (McCallum and Nigam 1998).

3.4 Evaluation

Recall, precision, and the F-score are used to evaluate the performance of the different classification techniques used in our analysis. Recall is a measure of coverage. It represents the ratio of correctly classified instances under a specific label to the number of instances in the data space that actually belong to that label. Precision, on the other hand, is a measure of accuracy. It represents the ratio of correctly classified instances under a specific label to the total number of classified instances under that label. Formally, if A is the set of data instances in the data space which belong to the label λ, and B is the set of data instances that were assigned by the classifier to that label, then the recall (R) and the precision (P) can be calculated as:

$$ R_{\lambda} = \frac{|A \cap B|}{|A|} $$
(1)
$$ P_{\lambda} = \frac{|A \cap B|}{|B|} $$
(2)

We also use the Fβ score to report our results. This measure represents the harmonic mean of recall and precision, such that:

$$ F_{\beta} = \frac{(\beta^{2} + 1) PR}{(\beta^{2}P + R)} $$
(3)

Different values for β can be used depending on the preference of precision versus recall (Berry 2017; Powers 2014). For instance, in tasks such as requirements traceability and bug localization (Huffman-Hayes et al. 2006; Khatiwada et al. 2018), errors of omission (false negatives) are harder to deal with than errors of commission (false positives). In such tasks, the F2 score, which emphasizes recall over precision, is typically used. In our analysis, we use F1 (β = 1) since we assume that both types of retrieval errors (omission and commission) have the same impact on effort saving. Our assumption is based on the fact that automated support is needed whenever the number of reviews is relatively large (up to thousands of reviews). Therefore, a low precision would force users to wade through many uninformative reviews to find the correct answers that are buried in the output. On the other hand, a low recall would force users to manually examine an even larger number of reviews to look for concerns that were not retrieved at all.

3.5 Results and Discussion

The results of our classification process are shown in Table 3. Using the BOF representation, SVM managed to outperform NB, achieving Fbugs = 0.86 and Ffeat. = 0.74, while NB achieved Fbugs = 0.81 and Ffeat. = 0.70. A similar behavior was observed under the BOW representation; SVM managed to achieve Fbugs = 0.85 and Ffeat. = 0.75, in comparison to NB which achieved Fbugs = 0.79 and Ffeat. = 0.72. In general, SVM outperforms NB, achieving almost the same performance under the two different representations of the data. The variation in precision under the different settings can be attributed to the precision-recall trade-off (a higher recall often leads to a larger number of false positives). The relatively better performance of SVM in comparison to NB can be attributed to its overfitting avoidance tendency—an inherent behavior of margin maximization which does not depend on the number of features (Brusilovsky et al. 2007). Therefore, it has the potential to scale up to high-dimensional data spaces with sparse instances, given that the right kernel is selected (Joachims 1998). Choosing a proper kernel function can significantly affect the generalization and predictive capabilities of SVM (Steinwart 2001). In our analysis, the best performance of the SVM+BOW classifier was achieved using the Normalized Poly Kernel, while the SVM+BOF classifier hit a maximum using the Pearson VII function-based universal kernel (Puk) with σ = 8 and ω = 1 (Üstün et al. 2006).

Table 3 The performance of NB and SVM using the BOF and the BOW representations of the data in Table 2

To assess the generative capabilities of our classifiers, we tested their performance on an external set of reviews that was sampled from apps that were not included in our original dataset (Google Chrome, Facebook, and Google Maps). We sampled different number of reviews from each app to enhance the diversity of the data. Similar to the reviews in original dataset (Table 2), the newly sampled reviews were classified manually by the authors and an external judge (See Section 3.1). Table 4 describes the final test dataset.Footnote 7 Our main objective is to test the ability of the generated classifiers to generalize over unseen-before data, in other words, test for overfitting. In automated classification, overfitting refers to a phenomenon where the classifier learns separate data instances (i.e., model the training data), rather than learning general categories. Formally, the model M overfits the data if there exists some other model M’, such that, M has a smaller error over the training data than M’. However, M’ has a smaller error than M over the entire distribution (Mitchell 1997).

Table 4 A test set of app reviews

To test for overfitting, the original models generated using the data in Table 2 were saved, reloaded, and re-evaluated over the test set. The performance of our different classifiers on the external test set is shown in Table 5. The results show that the BOF classifiers managed to outperform the classifiers generated using the BOW representation. More specifically, BOF+SVM achieved Fbugs = 0.96 and Ffeat. = 0.75. In contrast, the performance of the BOW classifiers has drastically dropped over the set of feature requests in the test set to Ffeat. = 0.54 for SVM and Ffeat. = 0.39 for NB, failing to match the performance achieved on the original dataset.

Table 5 The performance of the different classifiers over the test set (Table 4)

In general, the results over the test dataset suggest that NB and SVM trained under the BOW representation of the data suffered from overfitting. This behavior can be attributed to the fact that the feature space (number of words) is typically very large (Joachims 1998). A larger number of features causes the vector representation (BOW) of reviews to be very sparse (only very few entries with non-zero weights). This in turn forces the classifier to learn specific data instances rather than the general classification categories. The BOF representation, on the other hand, seems to be overcoming this problem by raising the level of abstraction from specific words to more abstract semantic representations. Reducing the number of features that the classifier needs to consider decreases the chances of overfitting and leads to better generalizations over unseen before data instances. For example, Table 6 shows the most popular frames in our original dataset (Table 2). The BOW training dataset did not have the word desire. As a result, the feature request “another window is highly desired” in our BOW test set was incorrectly classified as uninformative (i.e., otherwise). However, under the BOF representation, this review was correctly classified as a feature request since the word desire evoked the frame Desiring, which is one of the most distinctive frames of feature request reviews.

Table 6 The most popular frames in our original dataset (Table 2) and their evoking words

A smaller number of features not only reduces the chances of overfitting, but also speeds up the training process by reducing the computational requirements of the classifier. In our analysis, the BOF representation required 10 s to build the classifier and 96 s for evaluation using the 10-fold evaluation strategy, while the BOW representation required 32 s to build the classifier and 293 s for evaluation. This can be explained based on the fact that only 552 unique frames were used to build the BOF model, while the BOW model was built using 1592 unique words. On average, the BOF representation of the data saves up to 60% of the space and time requirements needed to build a model using the BOW representation.

It is important to point out that the semantic frames approach requires downloading the FrameNet database locally. This database occupies around 500 megabytes of space. This could be avoided by using the online semantic parser SEMAFOR.Footnote 8 However, the online service requires more time to generate the semantic representations of the reviews. In particular, each review needs a separate Web request. The returned Web page has to be parsed to extract the semantic frame representation of the text. Figure 2 shows the time required to extract the semantic representations of 10 reviews of length 3, 6, and 9 frames. The time is measured over 5 runs to ensure the accuracy of the readings. This analysis is executed on

Fig. 2
figure 2

The time required to generate the semantic representations of different length reviews (3, 6, and 9 frames) using the online SEMAFOR parser measured over 5 runs

4 Review Summarization

In the first phase of our analysis, we were able to isolate useful user reviews with a high level of accuracy. However, presenting such a large, and typically redundant, amount of raw reviews to developers can cause confusion. This emphasizes the need for automated methods to identify and summarize the most pressing issues in the technically informative reviews to facilitate a more effective data exploration process (Sorbo et al. 2016). A summary can be described as a short and compact description that outlines the main themes present in a text collection (Khabiri et al. 2011; Llewellyn et al. 2014). The objective of review summarization is to assimilate the concerns of a large number of users in a few main topics.

4.1 Automatic Summarization

The main task under this phase of our analysis can be described as a multi-document summarization problem, where each user review is treated as a separate document. Multi-document summarization techniques can be either extractive or abstractive. Abstractive methods involve generating novel concise sentences, with a proper English narrative, describing the overall content of the text collection. Extractive methods, on the other hand, select specific sentences, or keywords, already present in the text as representatives of the entire text collection.

a 2.80GHz CPU with 16.0GB of RAM at 50 Mbps internet speed.

Abstractive methods often include heavy lexical parsing and reasoning to paraphrase novel sentences around extracted information (Hahn and Mani 2000). Therefore, they are known to work for semantically rich and grammatically sound corpora with high controversiality, such as scientific documents and news article (Barzilay et al. 1999; Cheung 2008). However, from a linguistic point of view, user reviews on application stores can be described as pieces of short text. Short-text is a new type of text that has emerged recently in Natural Language Processing (NLP) research as a result of the explosive growth of micro-blogs on social media (e.g., Tweets and YouTube and Facebook comments) and the urgent need for effective methods to analyze such large amounts of limited textual data. Such texts are known to be lexically and semantically restricted, and typically contain colloquial terms (e.g., LOL, smh, idk) along with phonetic spellings and other neologisms (Squires 2010). For this type of text, extractive methods have been found to be more effective in generating concise summaries (Nichols et al. 2012).

The majority of extractive text summarization algorithms rely on the frequencies of words as an indication of their perceived importance (Hahn and Mani 2000). Specifically, the likelihood of words appearing in a human-generated summary is positively correlated with their frequency (Khabiri et al. 2011). Formally, an extractive summarization process can be described as follows: given a topic, or a phrase, M in a list of user reviews R, and assuming the desired summary length is K, generate a set of representative reviews R with a cardinality of K such that ∀riR,Mri and \(\phantom {\dot {i}\!}\forall r_{i}, \forall r_{j} \in R^{\prime }, r_{i} \nsim r_{j}\). The condition \(\phantom {\dot {i}\!}r_{i} \nsim r_{j}\) is enforced to ensure that the selected reviews to be included in the summary provide sufficiently different information (i.e., are not redundant) (Inouye and Kalita 2011).

Extractive summaries can take the form of a word cloud. A word cloud can be described as a visual representation of textual data in which important words are written (visualized) in a larger font size. The importance of a word in the tag cloud can be simply correlated to its frequency in the text. Figure 3 shows a word cloud generated for a set of reviews sampled from the Alexa app. The cloud shows the 30 most frequent words in the reviews after removing English stop-words.

Fig. 3
figure 3

Examples of user concerns raised in the reviews of the Alexa app summarized using full review summaries and a word cloud

While word clouds can capture the main concerns in the reviews, due to the lack of context, it is often unclear what these concerns actually are. In contrast, full-sentence extractive summaries have the advantage of preserving the context (Barker et al. 2016; Khabiri et al. 2011). For instance, Fig. 3 shows sample reviews related to two main issues raised in the set of Alexa’s reviews. These concerns are a request for a search option and a report of a white-screen bug. Extracting these full reviews gives developers a better idea of what the main user concerns actually are.

Based on these observations, in our analysis, we employ several full-sentence extractive summarization algorithms for review summarization. These algorithms have been heavily used in short-text summarization tasks and have been shown to generate human-like summaries (Erkan and Radev 2004; Inouye and Kalita 2011; Nenkova and Vanderwende 2005). Furthermore, these algorithms are easy to understand and implement and are computationally less expensive than other techniques such as topic modeling (Chen et al. 2014) or cluster-based summarization (Villarroel et al. 2016). In detail, our summarization algorithms can be described as follows:

  • Hybrid Term Frequency (TF): Hybrid TF relies on the basic frequency of words to determine the importance of a specific sentence (user review) to the collection. Formally, the weight of a word wi is computed as its frequency in the entire collection of reviews (fwi) divided by the number of unique words in the collection (N). This modification (i.e. hybrid) over classical single-document TF is necessary to capture the concerns that are frequent over the entire collection (Inouye and Kalita 2011). The probability of a review (rj) of length n words to appear in the summary is calculated as the average of the weights of its individual words:

    $$ score(r_{j}) = \frac{1}{n} \sum\limits_{i = 1}^{n} fw_{i} / N. $$
    (4)
  • Hybrid TF.IDF: Hybrid TF.IDF is similar in concept to hybrid TF (Inouye and Kalita 2011). However, it accounts for the scarcity of words across all user reviews by using the inverse document frequency (IDF) of words. IDF penalizes words that are too frequent in the text. Formally, TF.IDF can be computed as:

    $$ TF.IDF = TF(w_{i}) \times \log \frac{|R|}{|r_{j} : w_{i} \in r_{j} \wedge r_{j} \in R|} $$
    (5)

    where TF(wi) is the term frequency of the word wi in the entire collection, |R| is the total number of reviews in the collection, and |rj : wirjrjR| is the number of reviews in R that contain the word wi. The importance of a review can then be calculated as the average TF.IDF score of its individual words. To control for redundancy, or the chances of two very similar user review to be included in the summary, before adding a top scoring review to the summary, the algorithm makes sure that the review does not have a textual similarity above a certain threshold with any other reviews already present in the summary. The textual similarity between two reviews ri and rj can be calculated as the cosine of the angle between their vectors:

    figure g
  • SumBasic: Introduced by Nenkova and Vanderwende (2005), SumBasic uses the average term frequency (TF) of words in the text collection to determine their value. However, the weight of individual words is updated after it is included in the summary to minimize redundancy. This approach can be described as follows:

    1. 1.

      The probability of a word wi with a frequency of fwi in a corpus of size N words is calculated as:

      $$ \rho(w_{i}) = fw_{i}/N $$
      (7)
    2. 2.

      The weight of a review rj of length n words is calculated as the average probability of its words, given by:

      $$ score(r_{j}) = \frac{1}{n} \sum\limits_{i = 1}^{\vert n \vert} \rho(w_{i}) $$
      (8)
    3. 3.

      The top scoring review is selected and added to the summary. To control for redundancy, or to minimize the chances of selecting reviews describing the same topic using the same high frequency words, the probability of each word in the selected review is reduced by:

      $$ \rho(w_{i})_{new} = \rho(w_{i}) \times \rho(w_{i}) $$
      (9)
    4. 4.

      Repeat from 2 until the required length of the summary is met.

  • LexRank: LexRank is a graph-based algorithm that is used to determine the most important sentences in a given corpus. The algorithm works by generating an undirected graph of sentences in the collection (Erkan and Radev 2004). Individual sentences (nodes) in the graph are connected using their cosine similarity. An n × n cosine-similarity matrix is built for the graph. A threshold can be applied to the similarity matrix to filter out links that are not so significant. Individual sentences in the graph can then be ranked using the PageRank algorithm (Page et al. 1999). Formally, using LexRank, the probability of a sentence to be included in the summary, or p(u), can be described as follows:

    figure h

    where N is the total number of sentences in the document, d is the damping factor, typically selected as 0.85 (Brin and Page 1998), and is the TF.IDF cosine similarity between u and v (6). Using this formula, when computing LexRank for a sentence, the LexRank scores of the linking sentences are multiplied by the weights of the links, thus accounting for information subsumption among sentences (Erkan and Radev 2004).

Example

The following example demonstrates the operation of the different summarization algorithms using 4 reviews sampled from a picture sharing app. Two main user concerns are raised in these reviews. The first concern is a feature request (reviews R2 and R4), asking for a feature to display all pictures at once. The second concern is a bug report, describing a problem of a sudden crash whenever a picture is deleted (reviews R1 and R3).

figure j

After removing English stop-words and applying stemming, a total of 13 keywords are left to be considered by the summarization algorithms. Table 7 shows the frequency, Hybrid TF, and Hybrid TF.IDF weights of these words. Assuming a summary of length 2 is to be generated (only two reviews to be included in the summary), Hybrid TF first selects R3 as it has the highest average Hybrid TF scores (Table 8). The algorithm then randomly picks either R2 or R4 as they both rank second in the list.

Table 7 Frequency, Hybrid TF, Hybrid TF.IDF, and Hybrid TF2 of the words in the sample review corpus
Table 8 Hybrid TF score, Hybrid TF.IDF score, SumBasic2 score (i.e., the score after the first iteration of SumBasic), and LexRank score of our sample reviews

Using Hybrid TF.IDF, R1 will be added to the summary first as it has the highest average Hybrid TF.IDF score (Table 8). Before the algorithm makes its second selection, it calculates the textual similarity (6) between R1 and the other three reviews. R3 is the most similar to R1 as they share three words (crash, picture, delete). Therefore, it is not included in the summary. Both R2 and R4 share the same textual similarity with R1. The algorithm picks one of them randomly.

Using SumBasic, R3 gets selected first to be included in the summary as it has the highest average Hybrid TF score. After selection, the Hybrid TF weights of the words (crash, delete, picture) get reduced by squaring them (9). Column 5 of Table 7 shows the Hybrid TF2 of these words. The weights of the individual reviews are now re-calculated. The results are shown in column 4 of Table 8. Now both R2 and R4 are the top scoring reviews after removing R3, so the algorithm randomly picks one of them to be included in the summary.

Using LexRank, the algorithm first calculates the similarity between each two sentences using the TF.IDF cosine similarity (6). Figure 4 shows the resulting similarity graph. The values between the nodes denote the intra-sentence cosine similarities. Note that TF.IDF(picture) = 0 since it appears in all reviews. The LexRank scores for each sentence is then computed using (10). Initial p(v) values are set to 0.25 and a damping factor of 0.85 is used. Assuming the algorithm started its random walk from R1, the LexRanks of the different reviews after the first iteration are shown in Table 8. The algorithm ends up selecting R1 and R2 as they have the highest LexRanks.

Fig. 4
figure 4

The LexRank similarity graph of the reviews R1,R2,R3, and R4

4.2 Evaluation

The evaluation of summarization algorithms typically relies on the human judgment of the quality of generated summaries (Lin and Hovy 2003). For example, multiple judges are presented with different automated summaries of a specific text collection, and are then asked to rank these summaries based on their quality. Another evaluation approach relies on comparing the automatically-generated summaries with human-generated summaries (ground-truth) (Khabiri et al. 2011). In our analysis, we adopt the latter approach.

To conduct our evaluation, we recruited 8 programmers to participate in our experiment, including 4 graduate students in computer science and 4 industry professionals. Our subjects have reported an average of 4.3 years of programming experience. The apps Alexa, WellsFargo, Equifax, LinkedIn, FB Messenger, and Dubsmash were selected to conduct our experiment. For each app, we collected the most recent 500 reviews. The reviews were collected during the first week of April, 2017. These reviews were then classified using the BOF+SVM classifier. We then randomly sampled 100 informative reviews (classified as either bug reports or feature requests) from each app. These reviews were then randomized to create 4 different versions (same 100 reviews but different order). This step is necessary to avoid any ranking bias (e.g., subjects would always favor reviews from the top of the list). It is important to point out that, given that our classifier is only around 80% accurate, a small portion of the sampled reviews did not contain any useful information (i.e., were missclassified as informative).

Each of our subjects was then randomly assigned 3 different sets of reviews from three different apps to summarize, such that, each randomized copy of each set of reviews from each app is summarized by at least one subject. Formally, assuming our set of subjects is {s1,s2,…,s8}, the list of apps is {a,b,c,d,e,f}, and for each app α, the list of 4 different randomized sets of reviews is {α1,α2,α3,α4}, apps assignment to subjects can be described as follows:

  • s1 = {f4,a4,c2}

  • s2 = {a1,d3,c4}

  • s3 = {d1,f3,b4}

  • s4 = {b3,c1,f2}

  • s5 = {d2,a2,e4}

  • s6 = {e3,b2,e1}

  • s7 = {f1,d4,c3}

  • s8 = {a3,e2,b1}

The main task of our subjects was to go through each set of reviews and identify 10 reviews that they believed captured the most important concerns raised in the set. No time constraint was enforced. However, most of our subjects responded within a two week period.

The various summarization algorithms proposed earlier were then used to automatically summarize the set of 100 reviews sampled from each of the 6 apps included in our experiment. These reviews were initially pre-processed by stemming and by removing English stop-words. This step is necessary to generate more accurate summaries. For instance, common English words (e.g., the, could, they) or different forms of the same word (e.g., crash, crashes, crashing) can affect the frequency calculations of the summarization algorithms.

To assess the effectiveness of our summarization algorithms, for each app, we calculated the average term overlap between the human-generated, or reference, summaries and the various automatically generated summaries. This metric is based on ROUGE—a suite of metrics designed for the automatic evaluation of summarization algorithms (Lin 2004). Formally, the average recall of a summarization algorithm t can be calculated as:

$$ Recall_{t} = \frac{1}{\vert S \vert} \sum\limits_{i = 1}^{\vert S \vert} \frac{match(t, s_{i})}{count(s_{i})} $$
(11)

where S is the number of reference summaries, match(t,si) is the number of terms that appear in both the reference summary si and the summary generated by t, and count(si) is the number of unique terms in the reference summary si. An automated summary that contains a greater number of terms from the reference summary is considered more optimal (Inouye and Kalita 2011; Lin 2004; Nenkova and Vanderwende 2005). In our analysis, recall is measured over different length summaries (10, 15, and 20 reviews included in the summary). The recall of the different summarization algorithms is shown in Fig. 5.

Fig. 5
figure 5

The recall of the different summarization algorithms using the BOW approach measured at different length summaries (10, 15, 20)

We further assessed the performance of the different summarization algorithms using the BOF representation of the reviews. In particular, the semantic frame representation for each review from each app was generated. These reviews were then summarized based on their frame frequency (the frequency of the frames in the reviews is used rather than the frequency of the words). After each algorithm has picked the top 10, 15, and 20 reviews to be included in the summary, we regenerated the textual representations of these reviews and compared them against the human-generated summaries. The recall of the different summarization algorithms using the BOF approach is shown in Fig. 6.

Fig. 6
figure 6

The recall of the different summarization algorithms using the BOF approach measured at different length summaries (10, 15, 20)

4.3 Results and Discussion

We conducted a brief interview with our subjects at the end of the experiment to understand their summarization behavior. Out of our 8 subjects, 3 indicated that they read the reviews from the top of the list downward, selecting a review every time an issue appeared for a second or a third time. The other 5 subjects indicated that they first went through the list of reviews once or twice, identified the main (most frequent) concerns, then randomly selected reviews that captured all these concerns. All subjects indicated that the frequency of an issue was the deciding factor to whether to include that issue in the summary or not. In what follows, we present and discuss our results in greater detail.

4.3.1 Summarization Results

Figures 5 and 6 show the performance of the different summarization algorithms using the BOW and BOF representations of the data respectively. Furthermore, Table 9 shows the best performing summarization algorithm, in terms of recall, for each app using the BOF and BOW representations. Randomly generated summaries were used as an experimental baseline to compare the performance of our algorithms. The random baseline is commonly used to evaluate extractive-based summarization techniques (Inouye and Kalita 2011; Mackie et al. 2014). Basically, if a random extraction of text generates more cohesive summaries than a summarization algorithm then the algorithm is pretty much useless.

Table 9 The best performing summarization algorithm (recall) for each app under the BOW and BOF representations of the data

We conduct a two-way ANalysis Of Variance (ANOVA) to test if the difference in the quality of summaries between the two representations of the data is statistically significant. The data are normally distributed according to Kolmogorov-Smirnov test of normality (p = 0.200), thus ANOVA’s assumption of normality is met. Our first independent variable is the representation of the data (BOW and BOF) and our second independent variable is the summarization algorithm (Random, Hybrid TF, Hybrid TF.IDF, SumBasic, and LexRank). The dependent variable is the performance (as measured by (11)) of the summarization algorithms.

Assuming a significance level of α = 0.05, the results of our two-way ANOVA test show that there is a significant difference in the performance between the different algorithms (F = 21.58,p < 0.01). The results also show that the main effect of the data representation (BOF vs. BOW) is significant (F = 6.37,p < 0.05). A significant interaction effect is also detected between the representation of the data and the summarization algorithm (F = 4.92,p < 0.05). In particular, the summarization algorithms perform significantly better under the BOW representation.

We further run a Tukey’s Honest Significant Difference (HSD) test to determine which algorithms performed overall significantly better than others (Tukey 1949). Tukey’s HSD is a Post-Hoc analysis that can be run after ANOVA to determine which specific group’s means (compared with each other) are different. The test compares all possible pairs of means. The results in Table 10 show that, regardless of the data representation, all algorithms have significantly outperformed the random baseline. The results also show that SumBasic has managed to significantly outperform Hybrid TF (p < 0.01) and LexRank (p < 0.05). SumBasic has also outperformed Hybrid TF.IDF. However, the difference in the performance between these two algorithms failed to reach significance (p = 0.758).

Table 10 Comparing the performance of the different summarization algorithms using Tukey’s HSD Post-Hoc analysis

In general, under the BOW representation, SumBasic was the most successful in capturing the concerns raised in the human-generated summaries, achieving an average recall of 71%. Hybrid TF.IDF was also competitive, achieving an average recall of 60%. The best performance of Hybrid TF.IDF was achieved at 0.2 similarity threshold. Figure 7 shows that the performance of Hybrid TF.IDF deteriorates at larger thresholds (i.e., more similar reviews are allowed into the summary). Meanwhile, Hybrid TF failed to compete with the other algorithms, suggesting that redundancy control is important in order to achieve comprehensive summaries. LexRank, while it managed to slightly (but not significantly) outperform Hybrid TF (p = 0.836), could not match the performance of SumBasic and Hybrid TF.IDF, achieving an average recall of 41%.

Fig. 7
figure 7

The performance of Hybrid TF.IDF at different similarity (redundancy) thresholds

4.3.2 Examples of Generated Summaries

To get a sense of the performance of the different summarization algorithms, we examine their performance on the list of reviews sampled from the Alexa app. Figure 8 shows the summaries (10 reviews each) generated by each of the summarization algorithms. Words that indicate common user concerns are highlighted. Longer reviews are truncated to save space. Manually examining the list of reviews for the Alexa app shows that the most frequent concerns are bug reports of app freezing and crashing and a white screen problem, and feature requests for a landscape mode, a search option, and an enhanced interface. Some other, but less frequent, concerns include problems with pairing with other devices and minor system setup problems.

Fig. 8
figure 8

The main user concerns detected in the 10 reviews included in the summaries generated by the different summarization algorithms

Figure 8 shows that 7 out of the 10 reviews included in summary generated by Hybrid TF contained valid user concerns. However, these concerns are redundant, describing only the bugs of app crashing, freezing, and the white screen problem. At a similarity threshold of 0.2, Hybrid TF.IDF, was more successful than Hybrid TF, with 6 out of the 10 reviews included in the summary contained user concerns. However, these concerns covered a broader range of issues, including the bugs of system crashing, freezing, and the white screen problem, and the requests for a landscape mode and a search option.

Using SumBasic, 7 out of the top 10 reviews included in the summary contained valid user concerns. These concerns covered the bugs of crashing and freezing, the white screen problem, the request for the search option and the landscape mode, as well as other interface and usability issues. In the LexRank summary, 9 out of the 10 reviews contained technical user feedback. However, while it managed to capture some of the less popular issues, such as pairing with other devices and system setup problems, LexRank failed to capture major user concerns such as the requests for a search option and a landscape mode.

In general, our example shows that SumBasic and Hybrid TF.IDF were able to generate the most comprehensive summaries that captured the majority of the concerns our subjects identified. However, the best performance of Hybrid TF.IDF was only achieved after an exhaustive calibration of the redundancy control threshold. Hybrid TF was the least successful, only capturing the concerns that were most frequent in the reviews. In general, the words that are used to describe these concerns have the highest relative frequency. Therefore, reviews including these words tend to have higher scores than reviews containing other popular, but less frequent, concerns. LexRank was also less successful than SumBasic and Hybrid TF.IDF, even though it managed to add more technically useful reviews to the summary. This can be explained based on LexRank’s tendency to favor longer sentences (Otterbacher et al. 2009). Longer sentences tend to be more central in the similarity graph (Fig. 4) than shorter sentences. This can be attributed to the fact that, in addition to considering the value of a sentence to its neighbors, LexRank takes into account the importance of the neighbors to that sentence. Therefore, longer sentences that have more words are valued more than shorter sentences as they are strongly connected to more sentences in the similarity graph.

4.3.3 BOF vs. BOW Summarization

Our results also show that the overall performance of the summarization algorithms has significantly dropped under the BOF representation (F = 4.92,p < 0.05). In general, while using the BOF representation of the data had a positive impact on the classification accuracy, using this representation for extractive summarization seems to harm the performance. These conflicting results can be explained based on the level of abstraction required by different data mining tasks. More specifically, in review classification, we are interested in the general categories of the data, including whether a review describes a bug report or a feature request. In contrast, in summarization tasks, we are interested in a lower level of abstraction, down to the specific user issue. In such scenarios, using frame semantics might lead to information loss. The following three examples explain this problem:

  1. (a)

    The semantic representation of the review “I can’t download my videos” has only the frame Capability as words such as download and videos do not evoke any frames.

  2. (b)

    Even though the two reviews “Can we get a gray filter?” and “Can we get a red font pls?!” request two different features, both were regarded as one issue as they were annotated under the same frames as follows:

    figure k
  3. (c)

    In the following three reviews:

    • “Landscape mode was taken away! Bad move”

    • “App doesn’t work at all, just a blank white screen.”

    • “I reset it and now it won’t do anything thing.”

    A dominant frame such as Intentionally_Act that is evoked by the generic words bad, doesn’t, won’t tend to be mistaken for a dominant issue, thus misleading the summarization algorithm.

In summary, our results show that a simple frequency based summarization algorithm with redundancy control can generate summaries that are aligned with the human judgment to a large extent. The BOF representation, while can help in review classification, can lead to a significant decline in the performance of summarization algorithms. Therefore, for practical applications, a tool that relies on the BOF representation for classification and the BOW representation for summarization is expected to generate the most accurate results.

5 Tool Support: MARC 2.0

We implemented the classification and summarization algorithms investigated in our analysis in MARC 2.0—an extension of our M obile A pplication R eview C lassifier (MARC 1.0) prototype presented in Jha and Mahmoud (2017a). This implementationFootnote 9 is provided to demonstrate the computational feasibility of our algorithms, facilitate the replication of our analysis, and ultimately, be used by app developers in their daily review mining tasks. The following is a description of the main features of MARC 2.0.

5.1 Data Collection

MARC 2.0 supports a data collection feature that enables users to download the most recent reviews from the Apple App Store. Technically, MARC 2.0 uses iTunes IDs of apps to make web requests to the App Store’s RSS feed. The generated JSON pages are then parsed by a special-purpose parser to extract user reviews. App ID numbers can be obtained directly from the URL of the app on iTunes. For example, Gmail’s ID number (422689480) can be directly obtained directly from its iTunes page as follows:

figure l

Once the app ID number is provided, MARC 2.0 makes the following Web request:

figure m

Extracted reviews are then displayed to the user on the home page of MARC 2.0.

5.2 Classification

The classification engine of MARC 2.0 currently supports Naive Bayes (NB) and Support Vector Machines (SVM). These two classifiers are implemented through Weka’s API. This API converts the input reviews into a Weka compatible file format (.arff). The filter StringToWordVector is used to generate the word x document matrix for the reviews to be classified. MARC 2.0 uses a default training dataset of manually classified reviews to train and test the underlying classification engine. This dataset is compiled from the dataset used in our experimental analysis. Users can further provide their own training datasets. To ensure flexibility, MARC 2.0 enables users to choose from multiple text preprocessing settings, including stemming and stop-word removal. Stemming is implemented using Weka’s IteratedLovinsStemmer (Lovins 1968) and stop-words are provided in a separate configuration file that users can edit. Furthermore, MARC 2.0 enables users to select a data representation (BOW vs. BOF) for classification. The BOF representation is supported through a special purpose parser that reads and parses the XML file generated by the probabilistic frame semantic parser SEMAFOR (Das et al. 2010). Figure 2 shows the average time MARC 2.0 requires to generate the BOF representation of the input reviews.

5.3 Summarization

MARC 2.0 provides users with an option to summarize classified user reviews. The summarization engine of MARC 2.0 supports the various summarization algorithms evaluated in our analysis (Hybrid TF, Hybrid TF.IDF, SumBasic, and LexRank). Users can select the size of the summary (number of reviews to be included in the summary), as well as the threshold for Hybrid TF.IDF. MARC 2.0 also enables users to generate word cloud summaries of reviews. Users can switch back and forth between the two views.

5.4 Stop-Word Editor

MARC 2.0 provides users with a feature to edit their list of stop-words (add and remove words). Our analysis has shown that the quality of the generated summaries can be severely impacted by irrelevant words, or none English stop-words that do not provide any useful information to app developers. For instance, app names tend to appear frequently in their reviews, thus impacting the frequency calculations of the summarization algorithms. Such words typically do not appear in generic lists of English stop-words. Therefore, it is necessary to provide users with the ability to filter these words out.

6 Related Work

The research on app store review analysis has noticeably advanced in the past few years. Numerous studies have been conducted on review classification, summarization, and prioritization. A comprehensive survey of these studies is provided in Martin et al. (2017). In this section, we selectively review and discuss important related work in this domain. Our review also includes important work from the domain of mining social media platforms (Twitter) for software user feedback.

6.1 Mining App Store Reviews for User Feedback

Iacob and Harrison (2013) introduced MARA, a tool for mining feature requests from app store reviews. MARA identifies sentences expressing feature requests based on a set of predefined linguistic rules. These rules were identified by analyzing keywords and linguistic patterns associated with feature requests. MARA was evaluated using a sample of 480 reviews extracted from Google Play. The results showed that 23.3% of reviews represented feature requests.

Carreńo and Winbladh (2013) applied topic modeling and sentiment analysis classification to identify user comments relevant to requirement changes. Specifically, the authors processed user comments to extract the main topics mentioned as well as some sentences representative of those topics. Evaluating the proposed approach over three datasets of manually classified user reviews showed promising performance levels in terms of accuracy and effort-saving.

Guzman and Maalej (2014) proposed an automated approach to help developers filter, aggregate, and analyze app reviews. The proposed approach uses a collocation finding algorithm to extract any fine-grained requirements mentioned in the review. These requirements are grouped into more meaningful high-level features using topic modeling. The authors used over 32,210 reviews extracted from 7 iOS and Android apps to conduct their analysis. The results showed that the proposed approach managed to successfully capture and group the most common feature requests in the reviews.

Chen et al. (2014) presented AR-Miner, a computational framework that helps developers to identify the most informative user app reviews. Uninformative reviews were initially filtered out using Expectation Maximization for Naive Bayes—a semi supervised text classification algorithm. The remaining reviews were then analyzed and categorized into different groups using topic modeling (Blei et al. 2003). These groups were ranked by a review ranking scheme based on their potential information value. The proposed approach was evaluated on a manually classified dataset of app reviews collected from 4 popular Android apps. The results showed high accuracy levels in terms of precision, recall, and the quality of the ranking.

Panichella et al. (2015) proposed a supervised approach for classifying mobile app reviews into several categories of technical feedback (e.g., bug reports and feature requests). The authors extracted a set of linguistic features from each review, including the most important words, the main sentiment of the review, and any linguistic patterns that represented potential maintenance requests. Different types of classifiers were then trained using various combinations of these features. The results showed that Decision Trees (Quinlan 1986), trained over recurrent linguistic patterns and sentiment scores, achieved the best performance in terms of precision and recall.

Maalej and Nabil (2015) introduced several probabilistic techniques for classifying app reviews into bug reports, feature requests, user experiences, and ratings. The authors experimented with several binary and multi-class classifiers, including Naive Bayes, Decision Trees, and Maximum Entropy. A dataset of 4400 manually labeled reviews from Google Play and the Apple App Store was used to evaluate the performance of these different classifiers. The results showed that binary classifiers (Naive Bayes) were more accurate for predicting the review type than multi-class classifiers. The results also revealed that review features, such as star-rating, tense, sentiment scores, and length, as well as text analysis techniques, such as stemming and lemmatization, enhanced the accuracy of the classification.

Khalid et al. (2015) conducted an analytical study of user reviews with the main objective of helping developers to better anticipate and prioritize possible user complaints. The authors manually examined and classified thousands of app reviews from 20 iOS apps focusing on one and two star reviews. The analysis uncovered 12 types of common users complaints, with functional errors being the most frequent complaints.

Mcllroy et al. (2016) analyzed the multi-labeled nature of user reviews in popular app stores. A qualitative analysis of the data showed that a substantial amount (30%) of user reviews raised more than one issue type (feature requests, functional complaints, and privacy issues). The authors experimented with several classification and multi-labeling techniques to automatically assign multiple labels to reviews. The results showed that a combination of Pruned Sets with threshold extension (PSt) (Read et al. 2008) and SVM achieved the best performance.

Villarroel et al. (2016) introduced CLAP (Crowd Listener for releAse Planning). CLAP categorizes and prioritizes user reviews to aid in release planning. Technically, the authors used DBSCAN (Ester et al. 1996), a density-based algorithm for discovering clusters of related reviews. Random Forests algorithm was used to label each cluster as high or low priority based on factors such as the size of the cluster, its average review rating, and hardware devices mentioned in its reviews. CLAP was evaluated in industrial settings and using expert judgment. The results showed that CLAP can accurately categorize and cluster reviews and make meaningful release planning recommendations.

Ciurumelea et al. (2017) proposed a taxonomy to analyze reviews and codes of mobile apps for better release planning. The authors defined mobile specific categories of user reviews that can be highly relevant for developers during software maintenance (e.g., compatibility, usage, resources, pricing, and complaints). A prototype that uses Machine Learning (ML) and Information Retrieval (IR) techniques was then introduced to classify reviews and recommend source code files that are likely to be modified to handle issues raised in the reviews. The proposed approach was evaluated using 39 open source apps from Google Play. The results showed that the proposed approach can organize reviews according to the predefined taxonomy with a decent level of accuracy.

Groen et al. (2017) studied mining user quality concerns (non-functional requirements) from app reviews. By tagging online reviews, the authors found that users mainly expressed usability and reliability concerns, focusing on aspects such as operability, adaptability, fault tolerance, and interoperability. The authors further proposed a set of linguistic patterns to automatically capture usability concerns in user reviews. Evaluating these patterns using a large dataset of reviews showed that they can be used to identify statements about user quality concerns with high precision. However, very low recall levels were reported.

Johann et al. (2017) identified 18 part-of-speech patterns (e.g., verb-noun-noun) and 5 sentence patterns (e.g., enumerations and conjunctions) that are frequently used in review text to refer to app features. The main advantage of using such patterns over classification algorithms, such as SVM and NB, is that no large training and configuration data are required. The proposed approach was evaluated using reviews extracted from 10 different apps. The results showed that linguistic patterns outperformed other models that relied on more computationally expensive techniques, such as sentiment analysis and topic modeling.

In summary, our review shows that classical text classification techniques (SVM and NB) are frequently used to categorize app user reviews. In some cases, linguistic patterns are used as a direct and unsupervised approach for capturing user technical requests (Groen et al. 2017; Ha and Wagner 2013; Johann et al. 2017; Panichella et al. 2015). Such methods rely on mining recurrent syntactic patterns from reviews, such as “[someone] should try to [verb]”. Our work in this paper can be perceived as a combination of these two approaches. In particular, reviews are first translated into their linguistic frame representation. These representations are then classified using algorithms such as SVM and NB. The advantage of using semantic frames over linguistic patterns stems from the fact that preparing a complete list of patterns can be a laborious and time-consuming process. In particular, researchers have to manually mine hundreds of reviews to capture and isolate such patterns (Iacob and Harrison 2013). Consequently, this approach tends to achieve very low recall levels in comparison to text classification algorithms such as SVM and NB (Groen et al. 2017; Panichella et al. 2015). The semantic frames approach, however, rely on a database of more than 200,000 total annotation sets,Footnote 10 thus accounting for more linguistic structures that might emerge in the reviews.

In terms of summarization, our literature review shows that the majority of existing work relies on topic modeling techniques, such as Latent Dirichlet Allocation (LDA) (Blei et al. 2003), to create clusters of topically-related reviews (Carreńo and Winbladh 2013; Chen et al. 2014; Guzman and Maalej 2014). However, most state-of-the-art topic modeling techniques require an exhaustive calibration of several parameters in order to generate meaningful results (Blei et al. 2003). Furthermore, generated topics are often not trivial to interpret and rationalize, and going through a large number of topics (100-200) can be an exhaustive and error-prone process (Carreńo and Winbladh 2013). Other clustering techniques (e.g., DBSCAN) also require setting the values of several parameters a priori in order to produce meaningful clusters (Ciurumelea et al. 2017). This level of operational complexity limits the practicality of any tools built on top of these techniques (Lo et al. 2015). In our analysis, we avoid this complexity by using frequency-based summarization techniques. These techniques require no calibration from the user and have very low computational complexity in comparison to topic modeling and clustering techniques, thus, can be easily integrated into practical working prototypes.

6.2 Mining Twitter Feeds for Software User Feedback

Twitter enables large populations of end-users of software to publicly share their experiences and concerns about software systems in the form of micro-blogs. Similar to user feedback available on app stores, Twitter data can be collected and classified to help software developers infer users’ needs, detect bugs in their code, and plan for future releases of their systems. For instance, in our previous work (Williams and Mahmoud 2017), we examined Twitter as a source of useful user technical feedback that software developers can benefit from. Our analysis was conducted using 4,000 tweets collected from the Twitter feeds of 10 software systems. Our results revealed that around 50% of collected tweets contained actionable maintenance requests. The results also showed that text classifiers, such as SVM and NB, can be very effective in capturing and categorizing technically informative tweets. We further evaluated the performance of Hybrid TF, Hybrid TF.IDF, and SumBasic, in summarizing software-related tweets. Our results showed that, similar to our findings in this paper, SumBasic was able to achieve the highest agreement with our human-generated summaries.

Guzman et al. (2016), the authors manually analyzed and classified a sample of 1,000 tweets to determine the usage characteristics and content of software-relevant tweets. The analysis showed that software tweets contained useful information for different groups of technical and non-technical stakeholders. The authors also used automated classification to identify tweets relevant for the technical stakeholders. The results showed that SVM slightly outperformed Decision Trees in isolating useful tweets. In a more recent work, Guzman et al. (2017) presented ALERTme, an approach to automatically classify, group, and rank tweets about software applications. The authors used Multinomial NB to classify tweets with software improvement requests. Topic modeling was then used to group semantically related tweets. Topically-related groups of tweets were then ranked based on their perceived importance using a weighted function of various attributes. ALERTme was evaluated using 68,108 tweets collected from the Twitter feeds of multiple software applications. The results showed decent accuracy levels. The results also showed that Twitter could be an important source for information relevant to software evolution, including end-user requirements.

Nayebi et al. (2017) analyzed the Twitter feeds and software reviews of 70 mobile apps. The authors used NB to automatically classify their data into bug reports and feature requests. Topic modeling (LDA) was then used to generate groups of related concerns. The results showed that Twitter can provide complementary information to support mobile app development, with around 22% more feature requests and 13% more bug reports found on Twitter only.

In general, the converging evidence indicates that Twitter can be leveraged as another complementary channel of communication which app developers can exploit for technical feedback. Technically, the analysis suggests that tweets share several semantic and lexical attributes with user reviews. Therefore, classification (e.g., NB and SVM) and summarization (e.g., LDA and SumBasic) techniques that have been proven effective in app store review analysis can be adapted to Twitter data if certain limitations are addressed. For instance, Twitter messages are inherently shorter than user reviews. This might negatively impact the effectiveness of data intensive methods such as LDA. However, this issue can be addressed by using techniques such as assisted LDA which groups multiple tweets in order to produce longer text (Mehrotra et al. 2013). Furthermore, Twitter feeds often contain a higher percentage of irrelevant information and spam (Wang 2010). This might impact the accuracy of data classification and summarization techniques. To address this issue, spam filtering (McCord and Chuah 2011) and smart data collection techniques could be used to filter out such irrelevant data (e.g., only considering tweets that are directly addressed to the Twitter accounts of software systems) (Williams and Mahmoud 2017).

7 Threats to Validity

The analysis presented in this paper has several limitations. In what follows, we describe the potential internal, external, and construct validity threats of our study along with our mitigation strategies.

7.1 Internal Validity

Internal validity refers to the confounding factors that might affect the causal relations established in the experiment (Dean and Voss 1999). A potential threat to the internal validity of our study is the fact that human judgment was used to prepare our ground-truth dataset. Furthermore, human experts were used to generate our reference summaries. This might result in an experimental bias as humans tend to be subjective in their judgment. However, it is not uncommon in text classification tasks to use humans to manually classify the data. Similarly, evaluating machine-generated summaries against human-generated summaries is a standard evaluation procedure. Ultimately, humans are the intended users of the summaries, thus they are the best judge of their quality and cohesion. While the subjectivity and bias threats of using humans are inevitable, they can be partially mitigated by using multiple judges to classify and summarize the data. In our analysis, the data were classified by 3 different judges and summarized by 8 different experts with different levels of expertise.

A threat might stem from the fact that 4 of the human experts used to summarize our data were graduate students. However, we believe that the impact of this threat was minimal as all of our graduate student subjects have reported some sort of industrial experience (average 2 years). In fact, existing evidence in experimental-based software engineering research suggests that the differences between industrial professionals and graduate students are negligible (Runeson 2003).

7.2 External Validity

Threats to external validity impact the generalizability of the results (Dean and Voss 1999). In particular, the results of our experiment might not generalize beyond our specific experimental settings. A potential threat to our external validity stems from the dataset used in our classification and summarization experiments. In particular, our dataset is limited in size and was generated from a limited number of apps. To mitigate this threat, we compiled our dataset from several sources, including two external datasets that have been used before in the literature and a local dataset that was collected by us. We also made sure that our reviews were selected from a diverse set of apps, extending over a broad range of application domains.

7.3 Construct Validity

Construct validity is the degree to which the various performance measures accurately capture the concepts they purport to measure (Dean and Voss 1999). In our experiment, there were minimal threats to construct validity as the standard performance measures (Recall, Precision, and F1), which are extensively used in related research, were used to assess the performance of our classification techniques. We further used a metric that is based on ROUGE—a benchmark suite of metrics designed for the automatic evaluation of summaries (Lin 2004)—to evaluate our different extractive summarization algorithms. We believe that these measures sufficiently quantified the different aspects of performance we were interested in.

8 Conclusions and Future Work

User reviews on mobile application stores represent a rich source of technical information for app developers. Such information can be mined to enable an adaptive and responsive release planning process. Following this line of research, we investigated the performance of a novel semantically aware approach for classifying and summarizing user reviews on app stores. The proposed approach relies on semantic role labeling. In particular, individual user review sentences are extracted and annotated to identify the semantic roles played by the words that appear in each sentence. Such roles, known as semantic frames, capture the underlying meaning of the review. The main assumption is that relying on the meaning of the text enhances the predictive capabilities of data mining algorithms.

To conduct our analysis, an experimental dataset of user reviews was compiled from multiple sources (Chen et al. 2014; Maalej and Nabil 2015; Jha and Mahmoud 2017b). Individual reviews were semantically annotated using FrameNet. Annotated sentences, represented as Bags-of-Frames (BOF), were then classified using Naive Bayes (NB) and Support Vector Machines (SVM). The results showed that the Bag-of-Frames (BOF) approach achieved competitive results in comparison to the Bag-of-Words (BOW) approach. However, classifiers trained using the BOF representation of text were able to generalize better over a test set of never-seen before reviews, suggesting that the BOW classification models suffered from overfitting. The main advantage of the BOF approach over the BOW approach stems from the drastic reduction in the number of features required for classification. A smaller number of features (frames vs. words) can produce lower dimensional models, thus can make more accurate predictions.

To facilitate a more effective data exploration process, technically informative user reviews were then summarized using multiple summarization algorithms (hybrid TF, hybrid TF.IDF, SumBasic, and LexRank). These algorithms are known for their simplicity and effectiveness in the context of social media data. A human experiment, using 8 programmers, was conducted to assess the performance of these algorithms. Specifically, review summaries generated by the different summarization algorithms were compared to human generated summaries. The results showed that SumBasic, a frequency based summarization algorithm with redundancy control, was able to generate summaries that were aligned with the human judgment to a large extent. The results also showed that using the semantic representation (BOF) of the reviews can lead to information loss, thus generating less representative summaries. Finally, the line of research in this paper has opened several research directions to be pursued in our future work, including:

  • Analysis: In our future work, other data classification and summarization algorithms will be investigated. Our objective is to identify combinations of these algorithms that can achieve high accuracy levels while maintaining moderate computational and space complexities.

  • User studies: Multiple user studies will be conducted in order to evaluate the usage aspects of our proposed methods and tools as well as compare their performance against other existing app review classification and summarization tools (e.g. AR-Miner (Chen et al. 2014)).

  • App Market Analysis: A major part of our future work will be focused on utilizing our findings to support innovation and survival in the app market. This goal will be achieved by integrating other sources of user feedback (e.g., Twitter (Guzman et al. 2016; Williams and Mahmoud 2017)) with app store reviews to enable a comprehensive understanding of users’ expectations and needs and predict the impact of feature updates on app user acquisition and retention rates.