1 Introduction

Over the last years, software development and releasing activities have shifted from a traditional process, in which software projects are released following a clearly defined road-map, towards a modern process in which continuous releases become available on a weekly/daily basis. Nowadays, such an agile strategy is massively adopted by mobile applications (apps). Indeed, with more than two billion users relying on smart-phones and tablets (Catolino et al. 2019), mobile apps development undergoes continuous changes to add new features, fix reported issues, or adapt to new technological and environmental changes. Hence, many mobile applications often release daily updates of their applications to quickly deliver up-to-date applications to end users (Openja et al. 2020).

In this context, software change management represents a fast-paced task of extreme complexity (Klepper et al. 2015) while mobile release engineering in a non-trivial and risky task that requires comprehensive information and knowledge (Nayebi et al. 2016). In fact, the tension between release speed and quality is a major concern for mobile apps developers as bad changes adversely affect users experience and may drive them away over time in a very competitive mobile apps market (Palomba et al. 2015; Villarroel et al. 2016). Indeed, one of the unique and important features that mobile app platforms, such as Google Play Store, provide is the users reviews and rating. User reviews represent a powerful asset to reflect users’ (dis)satisfaction and can provide a complementary view on the app’s success and quality as large amount of reviews contain bug reports (Maalej and Nabil 2015; Panichella et al. 2015; 2016). For instance, unexpected or poor app changes may cause even loyal users to explore alternative apps as pointed out by Martens and Maalej (2019). Recently, Hassan et al. (2018) showed that various app changes such as feature removal and user interface (UI) issues have the influence to increase the number of negative user reviews, while bad updates having crashes and functional issues tend to be fixed in subsequent updates. Therefore, the analysis of user reviews about a specific update is of pivotal importance as pointed out by Hassan et al. (2018). Hence, providing developers with relevant tools to track and prevent bad updates before pushing them into the marketplace is crucial to maintain and improve the rating of their apps.

To address this issue, we introduce a novel approach, namely AppTracker, to automate the tracking of mobile app bad updates (i.e., updates with higher percentage of negative user reviews relative to the prior updates of the app (Hassan et al. 2018)). The problem is formulated as a three-class classification problem to classify releases into “good”, “bad” or “neutral”. In particular, we adopt the One-Versus-All (OVA) method (Rocha and Goldenstein 2013) which consists of decomposing our multi-class classification problem into multiple binary problems. Then, we evolve various binary classifiers to generate classification rules using Multi-Objective Genetic Programming (MOGP) as a base learner. In the context of OVA method, for each class i, we train a base MOGP learner using all instances of this class as positive data points, while the remaining classes are considered as negative data points. Our MOGP formulation is based on an adaptation of the non-dominated sorting genetic algorithm (NSGA-II) (Deb et al. 2002). MOGP techniques have been widely adopted in search-based software engineering (SBSE) (Harman and Jones 2001; Ouni 2020) to solve various classification-related software engineering problems (Saidani et al. 2020; Kessentini et al. 2014; Almarimi et al. 2020; Kessentini and Ouni 2017; Ouni et al. 2015; Harman et al. 2012), due to their efficiency in exploring large search spaces and searching optimal solutions. More specifically, AppTracker approach aims at learning patterns from examples of bad app releases that have been experienced by end users. These patterns are expressed in the form of tree-based solution representation that is expressed as logical combinations of metrics and their corresponding threshold values. These solutions are refined through a multi-objective evolutionary search process to converge towards the optimal detection rules that should cover as much as possible the accurately detected (1) bad, (2) good and (3) neutral releases from the base of real-world app release examples.

To evaluate AppTracker, we performed an empirical study on a large benchmark of 50,700 releases extracted from 1,717 popular apps in the Google Play Store.Footnote 1 Based on two different validation scenarios, within-project and cross-project settings, our obtained results confirm that AppTracker statistically advances the baseline techniques. Moreover, we leverage our generated rules by analyzing the obtained Pareto fronts (i.e., the non-dominated solutions) achieved by the MOGP algorithm. In particular, we measure the features’ importance using the Permutation Feature Importance (PFI) technique (Breiman 2001; Fisher et al. 2019), and then rank them using Scott-Knott (SK) algorithm (Tantithamthavorn et al. 2017; 2018a) in order to prioritize the refactoring efforts during the app’s maintenance. The results of this analysis reveal that the previous updates ratings and the APK size are the most important features for both within and cross-project scenarios.

1.1 Contributions

The paper makes the following main contributions:

  1. 1.

    A novel approach, AppTracker, formulating the detection of mobile apps releases as multi-class classification problem based on MOGP. We adopted MOGP as a base learner to support multi-class classification by decomposing it into multiple binary problems using the one-versus-all method. To the best of our knowledge, this is the first search-based technique for the detection of bad releases in mobile applications.

  2. 2.

    An empirical evaluation on a benchmark of 50,700 releases from 1,717 Android apps, shows that AppTracker outperforms various baseline machine learning techniques by achieving median F1 scores of 46% and 47% in within-project and for cross-project validations, respectively, across the three classes.

  3. 3.

    A qualitative analysis to discover which features are the most prominent using the optimal rules based on the Pareto fronts analysis. The results reveal that the previous updates ratings and the APK size are the most important features for both within and cross-project scenarios.

  4. 4.

    A longitudinal labeled data from the Google Play Store from 1,717 free-to-download Android apps having over 50,700 release updates for a period of over thee years (Dataset for bad releases detection 2021).

1.2 Replication Package

We provide our replication package containing all the materials to reproduce and extend our study (Dataset for bad releases detection 2021).

1.3 Paper Organization

In Section 2, we motivate the problem of tracking bad mobile releases with a real-world example. Then, we explain our approach in Section 3. Section 4 describes the experimental setup of our empirical study while Section 5 presents the results of this study. In Section 6, we elaborate on the implications of our results. Section 7 discusses the threats to validity. In Section 8, we survey the related work. Section 9 finally concludes and discusses future research directions.

2 Motivating Example

To show the importance of early identification of bad release updates in mobile apps, we describe in this section a motivating example from a real-world Android app. Let us consider Dubsmash,Footnote 2 a popular video sharing Android app (in the Video Players category). Dubsmash used to maintain a stable rating history of 4.2/5 and most of its updates were either “neutral” or “good” suggesting that the app have had a negligible negative user rating for its updates. However, looking at its release history, we observe that the negativity ratio (i.e. as the ratio of the percentage of negative reviews before update Ui to the percentage of negative reviews of update Ui) highly increased immediately after the release of U43 (16 October 2018) as shown in Fig. 1. For instance, users were unhappy and complaining about the recent updates leaving comments such as “This apps was very fun but got progressively worse but got used to it, now it’s practically unusable ...” (cf. Fig. 2). While the app developers started to deploy more frequent updates with shorter delays (with less than two weeks on average) to address the users concerns, users continued expressing their complaints. Within few months, the negative ratings increased from 10% to reach 25% on 14 March 2019.

Fig. 1
figure 1

A snapshot of updates fluctuations in Dubsmash app between 2016-05-17 and 2019-03-14

Fig. 2
figure 2

Examples of users’ reviews on the Dubsmash app from Google Play Store

A closer examination of the Dubsmash app change history has shown that during this period, many features were deleted from the app. Thus, the app installation size (i.e., APK file) has decreased by 71% (from 30 MB to 8.7 MB) and consequently the number of activities has dropped from 53 to 31 (-154%) and so is the number of intents that decreased by 78%. In addition, the minimum SDK version (i.e., the required Android version to run the app) has been upgraded from 4.1 to 4.4, which led to losing users who are using older SDK versions in their devices. These changes have led to many user complaints, as shown in Fig. 2.

This example indicates the usefulness and the need for an automated tool to track bad updates in order to avoid negatively affecting users experience and ensuring the success of the app and this by learning from previous types of updates (good, bad, or neutral). However, this task is not trivial in practice. In fact, the main difficulty lies in the complex search space as the number of possible combinations of update features (e.g., changes in the user interface, features removal or addition, SDK update, release size, changes in the app permissions, changes in the libraries, etc.) and their associated values is very large. Hence, tracking bad updates can be formulated as a search-based optimization problem to explore this large search space, in order to find the optimal detection rules for each class. Additionally, a practical tool should provide the developers with human explainable detection rules to help them gain insights into bad changes, especially when these changes are not trivial, as shown in this example. For instance, one possible detection rule for Dubsmash app, as illustrated in Fig. 3, indicates that to avoid a bad update, the decrease in the APK size (i.e., Chang_perc_APK_size) and in the number of intents (i.e. Chang_perc_Nintent) should not exceed 70% and 77%, respectively. Additionally, the number of activities (i.e., Nact) should be more than 52. On the other side, the minimum SDK version (i.e., Min_SDK) should not exceed 4.4. These conditions could be leveraged as refactoring recommendations that guide the developers in the maintenance process in order to maintain the app’s rating.

Fig. 3
figure 3

An illustrative example of bad releases detection rule for the Dubsmash app

In the next section, we describe AppTracker and show how we formulated the bad updates tracking problem as a multi-objective combinatorial optimization problem to address the above mentioned problems.

3 The AppTracker Approach

In this section, we describe our AppTracker approach to automatically track bad mobile apps releases using multi-objective genetic programming (MOGP).

3.1 Approach Overview

Figure 4 illustrates an overview of AppTracker, a two-phase framework (1) training and (2) detection. In the training phase, our main goal is to decompose the multi-class problem into multiple binary problems to build a set of binary detection rules from real-world examples of various Android apps releases. In the detection phase, we use these generated rules to detect the appropriate label (neutral, good or bad) for new unlabeled data (i.e., a new release).

Fig. 4
figure 4

Approach overview

As shown in Fig. 4, our framework takes as input a set of mobile app releases with known labels, i.e., “bad”, “good” or “neutral” (Step A). Then, the Step B consists of extracting a set of features characterizing the considered releases in order to feed the search-based algorithm using multi-objective genetic programming (MOGP). As output, a set of non-dominated rules (i.e., a solution that has a score for each objective such that no other solution within the set has a better score across all objectives) will be built (Step C). Thereafter, in the detection phase, the framework assigns the proper class of a new release knowing its characteristics (Step D) using an ensemble majority voting based on each rule’s score. In the next subsections, we detail each step.

3.2 Step A: Training Data Preparation: Collecting Apps Data

Our data collection follows a three-steps process. First, we collected app updates data (e.g., the APK files of the releases) of popular free Android apps in the Google Play Store. Then, we extracted app manifest information. Finally, we collected data about advertisement (Ads) libraries that are used in each app.

3.2.1 Collecting Updates of the Google Play Store Apps

To collect Google Play Store apps, we proceeded as follows:

A. Selecting Top Free-to-Download Apps

In this study, we focused on free-to-download apps of the Google Play Store (Noei et al. 2017). In particular, we selected a set of mobile apps with respect to the following criteria:

  • App popularity: We considered popular Android apps in Google Play Store as we expect that these apps are developed and maintained by developers who care about their apps rating, and have a large user-base.

  • App diversity: We considered the top popular Android apps across all categories in the Google Play Store to ensure that there is no bias towards specific app categories in our observations.

Our selection of the top free-to-download apps is based on App Annie’s report on popular apps (AppAnnie 2020) in the Google Play Store since 2016. Then, we selected the top-hundred apps in each app category so that our study does not impact by the variances across the different app categories. Next, we filtered out those apps that was repeated across the categories, and that were already removed from the Google Play Store during our study period. In total, we selected 1,717 apps having over 50,700 releases during our study period. Table 2 provides some statistics about the studied apps.

B. Crawling App Data Over Three Years

We used a Google Play crawler (Akdeniz 2013) to gather longitudinal data during the period 20 April 2016 to 20 September 2019 from the Google Play Store. Thereafter, for each studied app, we collected the following data:

  • General data: The app title, description, current number of downloads, and rating.

  • Updates data: The release notes of each update.

  • User reviews data: The review title, review contents, rating, and review time.

At the end of this step, we collected a total of 50,700 updates that were released during our study period. Table 2 summarizes the statistics about the collected updates for each category.

3.2.2 Extracting App Manifest Information

To collect metadata of an app, its components, and its requirements, we need to extract the app manifest file (i.e., AndroidManifest.xml) from the APK file of the app. We reverse engineered the APKs of each app update using the ApktoolFootnote 3 and extracted AndroidManifest.xml files from the collected APKs. Then, we parsed the AndroidManifest.xml and collected app metadata (e.g., permissions, activities, services, and target SDK versions, etc.).

3.2.3 Collecting Data About Integrated Ad Libraries

To collect integrated advertisement (Ad) libraries, we followed Ahasanuzzaman et al.’s technique (Ahasanuzzaman et al. 2020; Ahasanuzzaman et al. 2020). In particular, we extract the fully qualified class names of each class, and manually searched them on the web to identify ad library packages. Thereafter, we collected the list of integrated ad libraries in each update of the studied apps.

3.3 Characterizing the Studied Updates

We follow a similar approach of Hassan et al. (2018) to characterize the updates (e.g., good or bad updates) of an app based on the app user ratings. First, we calculate the Ratio of Negative RatingsRNR(Ui) of an update Ui of an app as the ratio of one or two star ratings of the update Ui to the total number of ratings of all updates. Then, we calculate the Median Ratio of Negative Ratings (MRNR(Ui)) of an update Ui which is the median of the Ratio of Negative Rating of all the previous updates of Ui. Finally, to characterize an update Ui, we measure the Negativity Ratio (NR) of Ui based on the RNR(Ui and MRNR(Ui) as follows:

$$ \begin{array}{@{}rcl@{}} \textit{Negativity Ratio}(U_{i})= \frac{{RNR}(U_{i})}{\textit{MRNR}(U_{i})} \end{array} $$
(1)

For instance, if an update with 10 user ratings (four ratings with two stars and six ratings with four stars), then the RNR score of this update is 0.4 \(\left ({RNR} = \frac {4}{10} = 0.4\right )\). If the MRNR of this update is 0.1, then its negativity ratio (NR) is 4 \(\left (NR=\frac {0.4}{0.1}=4\right )\).

We characterize an update of an app into three classes using the negativity ratio. Table 1 shows the rules for characterization of an update(Hassan et al. 2018). These classes are the target labels of an update in our dataset.

Table 1 Characterization of the updates of an app

Table 2 shows the number of good, bad and neutral updates across the studied categories.

Table 2 Summary of the collected data

3.4 Step B: Features Extraction

In our approach, we extracted a total of 41 metrics divided into ten dimensions that characterize the update’s rating and thus its likelihood of being label as “bad”, “good” or “neutral”. In Table 3, we list our metrics suite and explain the rational behind each of them. In particular, they identify the following categories:

  • Size of the App: consists of metrics related to the APK size of an app at the time of release. Larger size apps typically provide more features but at the same time, it requires more space and bandwidth to download the update which could impact the app’s rating. We also collect the number of activities and services in each app release. An activity provides a screen for users to interact, whereas a service is used to perform operations in the background. We also consider the app intents which define the app’s “intent” to perform an action.

  • Ad libraries: This dimension captures any changes in the number of displayed advertisements (ads). It has been shown that frequency and size of displayed Ad increases the number of negative reviews (Ahasanuzzaman et al. 2020; Gui et al. 2017).

  • SDK version: This dimension includes metrics related to the minimum and the target Software Development Kit (SDK). Higher minimum and target SDK versions might suggest that app included many new features but at the same time can lead to losing users who are using older SDK versions in their devices.

  • Permissions: In this dimension, we collect information about the user permissions. Higher number of permissions increases privacy risks, thus it might impact the release’s rating.

  • Marketing effort: This dimension includes metrics related to the release description (i.e., note) that is displayed to all users to present the new features or the resolved issues. Many changes in the release notes would signify many feature updates and improvements in the app

  • Link to last releases (s): This dimension is related to the app’s releasing stability overtime. Previous release ratings can help in predicting future release ratings.

  • Release time: This dimension is dedicated to measuring the release frequency. More frequent updates may still have the bug unsolved or frequent updates may increase more issues in the apps as developers try to give an update in quick succession. However, frequent updates may solve the issue and that may satisfy the users. Hassan et al. (2017) analyzed the emergency updates and found that the ratio of negative reviews is small for the emergency.

Table 3 AppTracker metrics

3.5 Step C: MOGP-Based Three-Class Classification

To address the three-class classification problem, we divide it into three binary classification problems using the One-Versus-All (OVA) method. For each binary classification instance, one class is labeled as a “positive class” (= 1) and all the other classes as “negative classes” (= 0), then we train the corresponding classification model. The main merit of this strategy is its interpretability since it allows gaining valuable knowledge about a given class by checking its corresponding model. Additionally, this strategy is commonly used and usually set as a default choice for Machine Learning (ML) models to handle multi-class classification problem (learn 2006b; Rocha and Goldenstein 2013).

3.5.1 Overview of NSGA-II

In this paper, we use NSGA-II as an intelligent search-based algorithm, that has been widely adopted to solve many software engineering problems (Harman et al. 2012; Harman et al. 2010; Saidani et al. 2020; Mkaouer et al. 2015; Ouni et al. 2016; Ouni 2020; Saidani et al. 2021; Ouni et al. 2012), to generate binary detection rules of each release class.

NSGA-II starts by randomly creating an initial population of individuals encoded using a specific representation. Then, a child population is generated from the population of parents using genetic operators (crossover and mutation). The whole population (that contains children and parents) is sorted according to their dominance level (Deb et al. 2002) and only the best N solutions are chosen (N is the population size, which is a parameter to be set). Then, a new population is created using selection, crossover and mutation. This process will be repeated until reaching the last iteration according to a stop criteria.

3.5.2 Adaptation of NSGA-II for Binary Classification

To adopt a search algorithm to a given problem, a set of elements need to be defined. In fact, it is insufficient to merely apply a search technique out of the box, as problem-specific adaptations need to be defined to ensure the best performance such as (i) solution representation, (ii) solution evolution, and (iii) solution evaluation.

Solution representation

In MOGP, a candidate solution, i.e., a detection rule, is represented as an IF–THEN rules with the following structure (Saidani et al. 2020; Ouni et al. 2013; Kessentini and Ouni 2017; Ouni et al. 2015):

figure a

The antecedent of the IF statement describes the conditions, i.e., pairs of metrics and their threshold values connected with mathematical operators (e.g., =,>,≥,<,≤), under which a release is considered as good, bad, or neutral. These pairs are combined using logic operators (OR, AND in our formulation). Figure 5 provides an example of a solution. This rule, represented by a binary tree, detects a bad release if it fulfills the situation where (1) the minimum change in the required version of SDK (Min-SDK_chang) equal to 1% or (2) the Ad library size (Nlib) is greater or equals to 5 or (3) the number of dangerous permissions (related to security) (dang_perm) is greater or equals to 2.

figure b
Fig. 5
figure 5

A simplified example of a solution representation

To generate the initial population, we start by randomly selecting a set of metrics and their threshold values and then assign them different nodes of a given individual, i.e., trees. To control for complexity, each solution size, i.e. the tree’s length, should vary between lower and upper-bound limits based on the total number of considered metrics to use within the detection rule. More precisely, for each solution, we assign:

  • For each leaf node one metric and its corresponding threshold. The latter is generated randomly between lower and upper bounds according to the values ranging of the related metric.

  • Each internal node (function) is randomly selected between AND and OR operators.

Genetic operators

We formulated our genetic operators as follows:

Mutation

In MOGP, the mutation can be applied to (i) a terminal or (ii) a function node. First, the mutation operator randomly selects a node in the tree to be mutated. Then, if the selected node is a terminal, it will be then replaced by another terminal (i.e., other metric or other threshold value, or both), and if it is a function node (i.e., AND, OR operators), it will be replaced by a new random function. Then, the node and its sub-tree will be replaced by the new randomly generated sub-tree. Figure 6 depicts an example of a mutation process, in which we replace the terminal containing Min-SDK_chang feature, by another terminal composed of the condition targ_sdk = 5. Thus, we obtain the new following rule:

figure c
Fig. 6
figure 6

An example of mutation operator

Crossover

For MOGP, we use the standard single-point crossover operator where two parents are selected and a sub-tree is extracted from each one. Figure 7 depicts an example of the crossover process. In fact, rules P1 and P2 are combined to generate two new rules. For instance, the new rule C2 will be:

figure d
Fig. 7
figure 7

An example of crossover operator

Solution Evaluation

Appropriate fitness function, also called objective function, should be defined to evaluate how good is a candidate solution. For the binary classification problem, we seek to optimize the two following objective functions:

  1. 1.

    Maximize the coverage of expected positive class instances over the actual list of positive class instances known as the True Positive Rate (TPR), or also the probability of detection (PD).

    $$ {TPR}(S) = \frac{\{{Detected Positive class instances}\}\cap\{{Expected Positive class instances}\}}{\{ {Detected Positive class instances}\}} $$
  2. 2.

    Minimize the coverage of actual non-skipped commits that are incorrectly classified as skipped also known as False Positive Rate (FPR), or the probability of false alarm (FP).

$$ {FPR}(S) = \frac{\{{Detected Positive class instances}\}\cap\{{Expected Negative class instances}\}}{\{ {Detected Positive class instances}\}} $$

Additionally, since NSGA-II returns a set of optimal (i.e. non-dominated) solutions in the Pareto front without ranking, we extract a single best solution which is the nearest to the ideal solution known as True Pareto in which TPR value equals to 1 and FPR equals to 0. Formally, the distance is computed in terms of Euclidean Distance (Ouni et al. 2016; Ouni et al. 2013; Saidani et al. 2020) as follows:

$$ BestSol= \min_{i=1}^{n} \sqrt{(1-TPR[i])^{2} + FPR[i]^{2}} $$
(2)

where n represents the cardinality the Pareto front generated by NSGA-II.

3.6 Step D: Detection Phase

After the optimal binary rules are built in the training phase, they will be then used to detect the corresponding label for a new app release. This step takes as input the set of features extracted from a given release using the feature extraction module. As output, it returns the label, i.e., good, bad, or neutral based on the majority voting principle.

3.6.1 Majority Voting

Each detection rule returns (i) either + 1 (to indicate that the input belongs to its class) or -1 (to indicate that the input does not belong to its class), and (ii) a confidence level measured by its fitness function value (the average between both objective function scores). Thus, we obtain for each class a two-dimension vector containing the weighted sum (i.e., multiplied by the confidence level measures) of positive and negative votes as its entries. However, two situations should be taken into consideration. First, in the case of conflict, i.e., two or more rules return + 1, the final label is assigned to the class having the highest confidence. Second, when no rule recognizes the input as its class (all the rules return -1), we assign the label to the class associated with the most negative confidence level.

4 Empirical Study Design

In this section, we describe the design of our empirical study to evaluate our AppTracker approach. Figure 8 provides an overview of our experimental design. First, we evaluate the predictive performance of our AppTracker approach based on NSGA-II against mono-objective search and state-of-the art machine learning algorithms to address the two first research questions. We run non-deterministic algorithms used in this empirical study 31 times to deal with their stochastic nature as suggested by Arcuri and Briand (2011). Afterward, we conduct an experiment to qualitatively investigate the most important metrics for our approach. In the following, we describe each step in detail.

Fig. 8
figure 8

An overview of our experimental design

To facilitate the replication and extension of our study, we provide the experimental material in our online replication package (Dataset for bad releases detection 2021).

4.1 Research Questions

We designed our experiments to answer three research questions (RQs):

  • RQ1 (Within-project evaluation). How does our AppTracker approach perform compared to baseline techniques in within-project scenario?

  • RQ2 (Cross-project evaluation). How effective is our AppTracker approach when applied in cross-project scenario?

  • RQ3 (Features importance analysis).What are the most important features for our tool?

4.2 Predictive performance (RQ1-2)

The first objective of our experimental study is to assess the efficiency of our AppTracker approach in solving the three-class classification of apps releases problem considering two different scenarios: within-project (RQ1) and cross-project validation (RQ2).

4.2.1 Evaluation Scenarios and Apps Filtering

In RQ1, we conduct a time-aware validation in which the chronological order is considered, similar to previous studies (Yan et al. 2020; Qiu et al. 2020; Yan et al. 2020; Huang et al. 2017; Yang et al. 2016). Specifically, we consider time series validationFootnote 4 which is a variation of k-fold where train/test sets are observed at fixed time intervals. In the kth split, the time series validation returns first k folds as the train set and the (k + 1)th fold as the test set. In this study, k is set to 5, the default value. Since this scenario is only useful for apps with sufficient historical data, we consider only apps that had at least 100 release versions. Additionally, we only select apps with at least one representation of each class in both training and testing sets. This filtering left 19 apps with 2,518 versions. An overview of the studied apps is reported in Table 4.

Table 4 Statistics about RQ1 corpus

Then, in RQ2, we investigate the extent to which our approach can be generalized through a cross-project prediction. In fact, mobile apps might not always have sufficient historical labeled data to build a classifier (Xia et al. 2016) (especially with small or new apps in the market), which may prevent the mobile app team from using within-project prediction tools, such as AppTracker . Hence, cross-project validation is a common state-of-the-art technique to solve the lack of training data in software engineering (Xia et al. 2017). To evaluate our approach on the cross-project scenario, we train each app based on the other collected apps from the same category. Then, we test our AppTracker approach on the target app data. Training on apps from the same category is useful for developers as this would help them track the bad updates of their competitors and attempt to avoid them (e.g. privacy violations).

Similar to RQ1, we only study apps with at least one instance of each update class in training/testing sets which left 1,313 apps with a total of 48,395 updates as shown in Table 5.

Table 5 Statistics about RQ2 corpus.

Note that for both RQ1 and RQ2, all the studied approaches are evaluated on unseen data (i.e. the testing data is not used at the training phase).

4.2.2 Baseline Approaches

As a basis for comparisons with our MOGP method, we have employed representative families of classification, a GP-based approach and common Machine Learning (ML) families that are widely used in solving several software engineering problems. In each algorithms family, we consider two approaches, discretized-based classifiers where the instances are classified into one of the three classes and regression-based classifiers that build a regression model first based on the negativity ratio, then perform classification according to Table 1. The considered baselines are presented in Table 6.

Table 6 Selected baselines from each family

Furthermore, as ML models are sensitive to the scale of the inputs, the data are normalized in the range [0,1] by using feature scaling. In addition, to mitigate the issue related to the imbalanced nature of the dataset, we rely on Synthetic Minority Oversampling Technique (SMOTE) method (Chawla et al. 2002), to resample the training data. Note that with XGB there is no need for resampling as it is internally handled by the algorithm similarly to our approach. Also, it is worth to mention that we only resample the training data in order to assess these algorithms in a real-world situation.

ML and XGB models are implemented using Scikit-learn (learn 2006a) and XGB (XGBoost 2006) Python libraries, respectively. As for the search-based algorithms, we used MOEA Framework,Footnote 5 an open-source framework for developing and experimenting with search-based algorithms (Hadka).

4.2.3 Evaluation Metrics

To compare the predictive performance of AppTracker with other techniques, we employ for binary classification, F1-score, the commonly used metric in predictive models comparison (Hastie et al. 2009) which is defined as the harmonic mean of the precision and recall of prediction. The Precision measures to the ability of a classifier not to label as positive a sample that is negative, while Recall measure the ability of a classifier to find all the positive samples. We also use Area Under the ROC Curve (AUC) which indicates how much a prediction model/rule is capable of distinguishing between postive and negative classes. In our study, we consider the following binary measures:

  • True Positive (TP): the number of positive class instances that are correctly classified;

  • True Negative (TN): the number of negative instances that are correctly classified as CI negative;

  • False Positive (FP): the number of negative instances classified as positive;

  • False Negative (FN): the number of positive instances that identified as negative.

  • n, m and p represents the number of instances of bad, good and neutral release classes, respectively.

For multi-class classification, we consider Matthews Correlation Coefficient (MCC) (Chicco and Jurman 2020) computed as a correlation coefficient between the observed and predicted classifications. Additionally, we calculate the Standard (also called macro) averages of the binary metrics as done by previous studies (Sokolova and Lapalme 2009; Branco et al. 2017; Hossin and Sulaiman 2015) and the Weighted (i.e., weighted by the number of instances per class) averages in order to account for class imbalance (Evans et al. 2019; Eberius et al. 2015; Hassan et al. 2020). All the used measures are defined in Table 7.

Table 7 Performance measures

4.2.4 Dealing with Stochastic Approaches

Due to the stochastic nature of genetic algorithms, decision tree (DT) and random forest (RF) algorithms, we compare their performance by performing 31 independent runs for each experimentation then we choose the rule/model with the median value as suggested in Arcuri and Briand (2011) work.

4.2.5 Statistical Tests Methods Used

Before selecting the statistical tests, we should first assess the data normality. To this end, we employ Shapiro-Wilk’s W test (Royston 1992) to assess whether the data distribution is normal (i.e., ρvalue ≥ 0.05). Using this test, We found that ρvalue < 0.05 for all the used metrics suggesting that a non-parametric test should be used.

In order to provide support for the conclusions derived from the obtained results, we use Wilcoxon signed rank test (Wilcoxon et al. 1970) with a 95% confidence level while using Bonferroni correction (Armstrong 2014). Vargha-Delaney A (VDA) (Vargha and Delaney 2000) is also used to measure the effect size. This non-parametric method is widely recommended in SBSE context (Nejati and Gay 2019) and indicates the probability that one technique will outperform another technique in a given performance measure. When comparing the performance of two techniques, a Vargha-Delaney A measure equals to 0.5 indicates that the two techniques are of comparable performance (i.e., do not differ), while a measure above or below 0.5 indicates that one of the techniques outperforms the other (Thomas et al. 2014). The Vargha-Delaney statistic also classifies the magnitude of the obtained effect size value into four different levels: (i) negligible (ii), small, (iii) medium, and (iv) large (Scalabrino et al. 2016).

4.2.6 Parameters’ Tuning and Setting

One of the most important aspects of research on prediction approaches is parameters’ tuning which has a critical impact on the algorithm’s performance (Arcuri and Fraser 2011). This is also compulsory when using ML techniques (Tantithamthavorn et al. 2018b). There is no optimal parameters setting to solve all problems, therefore, we used a trial-and-error method to select the hyper-parameters (Harman et al. 2012) to handle parameters’ tuning for search-based algorithms which is a common practice in SBSE (Harman et al. 2012). These parameters are fixed as follows: population size = 100; maximum # of generations = 500; crossover probability = 0.7; and mutation probability = 0.1.

As for ML techniques, we employed Grid Search (GS)(Scikit-learn.org 2006), an exhaustive search-based tuning method widely used in practice. In order to facilitate the replication of our results, we provide the selected main parameters and their respective search spaces for ML techniques as shown in Table 8. Please note that parameters’ tuning is only applied to the training set and hence we cannot guarantee an optimal result on the testing set; as the parameters’ tuning may lead to over-fitting (Tantithamthavorn et al. 2018a).

Table 8 Configuration space for the hyper-parameters of ML models

4.3 Features’ Importance Analysis (RQ3)

The second goal of our empirical study is to analyze the most important features. This analysis provides actionable insights for (1) practitioners who might want to identify the factors that can help them maintaining/improving the rating of their apps, and (2) researchers who are interested in understanding which/how features can be influential in mobile app releasing activities.

To address RQ3, we use Permutation Feature Importance (PFI) technique, introduced by Breiman (2001) and Fisher et al. (2019), to discover which features are the most useful for prediction. The importance of a certain feature is computed as the degree of change in the prediction performance in terms of Gini measure (defined as 2 * AUC - 1). Since the dataset may contain multicollinear features, the permutation importance can perform poorly. Hence, to handle multicollinearity issues, we perform hierarchical clustering on the Spearman rank-order correlations (Zar 2005), and keep only one single feature from each cluster. Once the (PFI) is computed, we rank the features using Scott-Knott algorithm (Tantithamthavorn et al. 2017; 2018a) into statistically homogeneous groups so that the obtained rankings within the same group are not significantly different (i.e., ρvalue ≥ 0.05). Scott-Knott algorithm has been widely applied to different software engineering domains such as identifying the most influential variables (Kabinna et al. 2018; Li et al. 2017; Tian et al. 2015; Tantithamthavorn et al. 2015). It should be noted that we use the non-parametric version of of the Scott-Knott algorithm that does not require the assumptions of normal distribution.

5 Empirical Study Results

5.1 Results of RQ1 (Within-Project Validation)

Table 9 reports the median F1, AUC and MCC scores achieved by AppTracker compared to the baseline approaches; while Table 10 shows the statistical tests comparison using the Wilcoxon signed rank test and Vargha-Delaney A estimate and effect size. In addition, we show the different distributions of the studied scores in Fig. 9.

Table 9 Performance of AppTracker vs. the state-of-the-art within-project validation (Median scores among the studied apps in percentage)
Table 10 Statistical tests results of AppTracker compared to ML techniques (within-project)
Fig. 9
figure 9

Boxplots comparing scores of AppTracker against baseline approaches within-project validation

As shown in Table 9, our approach achieved satisfactory results for standard and weighted measures and can reach 81% and 90% in terms of standard and weighted F1 measures, respectively (cf. Fig. 10). More specifically we obtained in median 61% for Weighted-F1, 66% for Weighted-Precision, 62% for Weighted-Recall and %67 in terms of Weighted-AUC. With regards to standard scores, we obtained 52% in terms of Standard-F1, 58% for Standard-Precision, 62% for Standard-Recall and 68% for Standard-AUC. The results are well above 1/3 (33.33 %) which is the random chance of guessing that an update belongs to one of three classes labels (i.e., in a three-class classification problem). To get more insights, we investigated the performance of each binary classification. As Fig. 10 demonstrates, the binary classification of bad updates performs better compared to others by reaching in median 57% in terms of F1bad, 60% for Precisionbad, 56% for Recallbad and 68% for AUCbad respectively. However, the statistical difference tests reveal that the scores are comparable for the three classes and this is applied to all the studied metrics (i.e. F1, Precision, recall and AUC).

Fig. 10
figure 10

Results of AppTracker within-project validation

In comparison with the mono-objective formulation, we clearly see that our MOGP technique outperforms mono-GP with a substantial improvement for all the studied metrics. For example, we achieved an improvement of 15% and 16% for the Standard and Weighted F1 measures, respectively. Moreover, the statistical test results (Table 10) reveal that over 2,945 runs (5 validation folds x 19 app x 31 repetitions), the difference in scores is significant with large VDA effect sizes. These findings confirm that multi-objective formulation is adequate for this problem comparing to aggregating the objectives into a single fitness function. Hence, our problem formulation passes the “sanity check” in this RQ.

Compared to ML techniques, we find that our AppTracker approach is advantageous over the studied techniques. For instance, AppTracker provides an improvement of at least 24% in terms of MCC over the best ML algorithm (LR). Additionally, the statistical analysis underlines the significant differences with large VDA effect sizes (cf. Table 10). Overall, the results reveal that AppTracker can reach the best balance between the three class accuracies. It is worth noting that all ML techniques are trained using re-sampled training sets unlike in NSGA-II which uses the original data without sampling. These results confirm that the multi-objective formulation is efficient in addressing with the data imbalance problem (Bhowan et al. 2010; Saidani et al. 2020).

Finally, it is worth to note, the regression-based classifiers perform less than other ML techniques (The discretized classifiers used in the study) as well as AppTracker . We also performed statistical test between AppTracker and the other ML approaches. We observe that AppTracker statistically outperforms other ML approaches (with a large effect size in the majority cases). Tables 10 and 12 present the statistical test results for within project and cross-project scenarios, respectively. These results indicate that the discretized classification is more adequate for the three-class classification of mobile releases.

figure e

5.2 Results of RQ2 (Cross-Project Validation)

In this RQ, we compare AppTracker with the examined baseline approaches under cross-project validation, using our evaluation metrics, the standard and weighted average scores of F1-score, AUC and MCC, to measure the performance of our approach. Table 11 presents the effectiveness of cross-project modeling compared to the baseline techniques while Table 12 reports the statistical tests results. In addition, we show the different distributions of the studied scores in Fig. 12.

Table 11 Performance of AppTracker vs. the state-of-the-art for cross-project validation (Median scores among the studied apps in percentage)
Table 12 Statistical tests results of AppTracker compared to ML techniques for the cross-project scenario

First, the average values of standard and weighted F1-scores obtained by our AppTracker are acceptable by achieving median scores of 47% and 56% respectively and can reach 90% (cf. Fig. 11). Regarding the binary classifications, Fig. 11 shows that the scores obtained for the “good” class are generally better which is in line with the statistical tests results. Thus, we believe that further research is needed to improve the prediction of “neutral” and “bad” updates classes (Fig. 12).

Fig. 11
figure 11

Results of AppTracker for cross-project validation

Fig. 12
figure 12

Boxplots comparing scores of AppTracker against baseline approaches for cross-project validation

Compared to the baseline approaches, we clearly see that, similar to RQ1, AppTracker remains the best approach. For instance, AppTracker achieves 9% of improvement in terms of MCC over SVC, the best ML technique, and 17% compared to mono-GP. Moreover, the statistical analysis confirms that all results are significantly different with small to large effect sizes as reported in Table 12.

Compared to the within-project validation (RQ1), the results of our approach have decreased, with 9% in terms of MCC and 3-5% in terms of AUC and F1 scores but with negligible (for F1-standard) to small effect sizes. But overall, we believe that AppTracker still is a promising solution that allows mitigating the lack of data, especially for new mobile apps having no enough release history, and outperforms the state-of-the-art approaches.

figure f

5.3 Results of RQ3 (Feature Importance Analysis)

While in the previous RQs, we investigated the predictive performance of AppTracker, in this stage we are interested in understanding how important is each feature for the generated rules, as this would be helpful to prioritize the refactoring efforts during the maintenance process. To this end, we apply the Permutation Feature Importance (PFI) technique then, we cluster the results using the Scott-Knott test. In the following, we report the results of feature importance analysis within-project and under cross-project validations. For the sake of readability, we report only the top-5 metrics (in terms of their importance scores). For more details, please refer to our replication package (Dataset for bad releases detection 2021).

5.3.1 Within-Project Results

Table 13 shows the top-5 metrics ranked and grouped by their importance scores, as determined by the Scott-Knott test.

Table 13 The ranking of the top-5 features within-project scenario, divided into distinct groups that have a statistically significant difference in the mean

Link to the last update

The results show that the median percentage of negative rating of the previous update (last_perc_neg_rating) is the most important feature for our approach, with an average score of 9%. A closer examination reveals that this feature achieves the highest scores in 6 out of 19 apps. For example, in com.lionmobi.powerclean 85% of bad updates have last_perc_neg_rating ≥ 3.9%. In this app, removing this feature would result in a decrease of 13% in the prediction accuracy of AppTracker. A similar observation can be applied to com.google.android.youtube app in which we also observed that eliminating last_perc_neg_rating would a decrease of 20% in the prediction accuracy. This can be explained by the fact that the app may have some unstable moments in which users continue expressing their complaints related to an issue from the previous update. Hence, our findings comply with prior work showing that developers may need to perform changes through multiple updates until they recover from a bad update (Hassan et al. 2018).

Release Size

The installation size of an app (APK_size) is the second most important feature across the studied apps with an average score of 7.3% and being the most important feature for one app, namely air.com.playtika.slotomania in which the feature obtained 17% of importance score. Furthermore, the percentage of change in the installation size (chang_perc_APK_size) is the top-3 feature but with no statistical difference compared to APK_size according to Scott-Knott tests results. Additionally, chang_perc_APK_size is the top-1 for two out of 19 apps, which indicates that the change in the size of an app at the time of the release could affect the current rating. For instance, we found in com.emoji.coolkeyboard app, that 56% of bad updates have last_perc_neg_rating ≥ 2%; which indicates that larger volume of code implies higher probability to contain a bug (Tian et al. 2015) and thus may lead to the user’s dissatisfaction.

Release time

release_time and delay_last_release (G3) have also helped in discriminating the updates. While release_time has on average an importance score of 6%, delay_last_release obtained 5% and appears on top of the most important features for one app namely com.emoji.ikeyboard. In this app, eliminating delay_last_release feature in this app can lead to a decrease of 10% in the prediction accuracy of AppTracker. Additionally, a manual investigation has revealed that all the bad updates in this app have delay_last_release ≤ 31 days which suggests that faster release time can introduce more bugs and thus lead to negative ratings. We also advocate that developers may need to employ proper testing tools to assure the quality of their quickly deployed releases.

5.3.2 Cross-Project Results

The PFI analysis results under cross-project scenario are displayed in Table 14.

Table 14 The ranking of the top-5 features for cross-project scenario, divided into distinct groups that have a statistically significant difference in the mean

The APK size

This dimension appears again in the top-3 list of most important features with two factors (APK_size and chang_perc_APK_size) and the scores of these features are comparable as revealed by Scott-Knott ESD test (i.e. clustered in the same group G1). While APK_size appears on the top-1 list of 278 apps, chang_perc_APK_size is the top-1 feature in 183 out of 1,313 apps. Hence, developers can consider optimizing their code complexity as a mean to fix/avoid update issues.

Library

The number of integrated libraries (Nlib) is the top-3 most important feature with an average score of 6.1% and being the top-1 in 99 apps. By examining our generated rules of bad updates, we have found that Nlib is usually associated with ≥. This result is in line with Ahasanuzzaman et al. (2020) and Gui et al. (2017) studies’ results as the authors showed that the frequency and size of displayed Ad increases the number of negative reviews.

SDK

The SDK dimension seems to be helpful to differentiate the updates under cross-project scenario. In fact, the minimum SDK (min_SDK) is the top-4 most important feature with an average score of 5.7% and being the top-1 for 47 apps. This finding is in line with previous study by Tian et al. (2015) in which authors found that, high-rated apps have a higher minimum and target SDK as users are benefiting from the latest features provided by SDK.

Link to previous updates

The results in the table clearly indicate that The median aggregated rating of all previous updates (hist_rating) is among the most important features for the all studied apps with an average score of 5.1%. We also found this feature to be dominant in 83 out of 1,313 apps, which strengthens our previous findings claiming that if the previous update’s rating highly affects the label of the current update. Being in line with our motivating example in Section 2, this finding indicates that it is indeed hard to keep the users’ confidence if a bad release occurs. That is, getting back the users’ satisfaction may need time.

figure g

6 Discussion and Implications

In this section, we discuss the implications of our results in practice.

Supporting Mobile Apps Developers Track Bad Updates

The usefulness of our AppTracker approach has been shown through its achieved performance in both within and cross-project validations. Nevertheless, we believe that the key innovation of our approach is its ability to provide the user with a comprehensible justification for the classification especially when the changes made in the release are non-trivial. Moreover, it is worth noting that, thanks to the flexibility of MOGP techniques, it can be possible to reduce the complexity of the generated detection rules (e.g., tree size and/or depth) in order to generate more comprehensible justification by considering this objective in the fitness function (or as a constraint in the solution encoding), but at the cost of scarifying the accuracy as these objectives are in conflict (Saidani et al. 2020).

Android Developers Need to Pay Attention to the Quality of their App Next Release

Our results indicate that the history of the previous negative rating (i.e., the hist_perc_neg_rating and lastt_perc_neg_rating features) is among the top important features. Hence, if an app loses reputation through repeated bad releases, it will be hard to get back its reputation in the future. Often, time constraints push mobile apps developers to release faster, however, they should consider a trade-off between time and quality. That is, given that the mobile apps market is evolving quickly with many competitors, developers should pay special care to their updates and should maintain their reputation over time.

The Smaller the Release, the Smaller the Risk of Releasing

Our results, for the most important features, indicate that the change in the release size (APK_size) is among the most influencing features. While users typically tend to see new features, improvements, and bug fixes, released regularly, as a sign of evolution, there is a dilemma with this. Moving features around and changing behavior can be confusing and harm app’s user experience, so it’s important to manage how new changes are released. Little and often is a good way to go as small releases are less risky. For instance, suppose a developer releases ten features at once, the risk of having a bug is high. In worst scenario, each of the ten released feature, can have a bug. If this happens, the developer would be in a bad situation, trying to fix ten serious bugs and get an update out as soon as possible. To minimize your risk, releasing smaller and more frequent is likely a successful strategy.

Learn Best Practices for the Next Release in Mobile Apps Development

Teaching the next generation of engineers best practices for the release management process and its impact on the users is of crucial importance. Educators can use our study results and our dataset (Dataset for bad releases detection 2021) to teach and motivate students to follow best release practices while avoiding bad updates that may cause user dissatisfaction or regression in their apps. In particular, our real world dataset of 50,700 updates from 1,717 Android apps, represents a valuable resource that could enable the introduction of mobile apps release to students using a “learn by example” methodology, illustrating best releasing practices that should be followed and bad practices that should be avoided.

Other Formulations for the Problem

Within the evolutionary process, our technique evolves detection rules, mimicking the creation of decision trees, to solve the three-class classification problem. While in this paper we showed that this tree-based approach can achieve satisfactory results, there is a room of improvement. For instance, it is interesting to explore solving the three-class classification problem without decomposing it to multiple binary classifications.

7 Threats to Validity

In this section, we review the main threats to the validity of our findings:

Threats to Internal Validity

Are concerned with the factors that could have affected the validity of our results. The main concern could be related to the stochastic nature of search-based algorithms, and some ML techniques (e.g. DT). To address this issue, we repeated each experimentation 31 times and considered the median scores values used to evaluate the predictive performance. Threats to internal validity could also be related to possible errors in our experiments. To conduct our experiments, we used real-world dataset collected from Google Play Store, the largest market place for mobile applications and mined user reviews on real time in a period of over three years using a dedicated tool. Another possible threat to internal validity could be related to bias in the replication of the benchmark approaches. We employed widely used tools and implementation of the search algorithms, MOEA Framework (Hadka), and the Scikit-learn (learn 2006a) and XGB (XGBoost 2006) Python libraries for the machine learning algorithms. respectively. Thus, we believe that there is a negligible bias towards internal threats to validity.

Threats to Construct Validity

are mainly related to the rigor of the study design. First, we relied on three standard performance metrics namely F1-score widely employed in predictive models comparison (Hastie et al. 2009). Second, although we used different families learning algorithms, there exist other techniques. As a future work, we plan to extend our empirical study with other baseline techniques. Another threat to construct validity could be related to parameters’ tuning as setting different parameters can lead to different results for search-based as well as ML techniques. We mitigated this issue by applying several trial and error iterations to tune search-based algorithms and relied on Grid Search (Scikit-learn.org 2006) method to find the optimal settings of ML techniques. Thus, future replication of this work should explore other ranges/parameters and their impacts on the predictive performance. An additional threat to internal validity is related to training and test sets selection. As an attempt to mitigate this issue, we considered in RQ1 the time series validation which is a realistic scenario as it considers the chronological order of apps’ releases. In RQ2, we selected a typical scenario in which we train AppTracker on data from the same category (i.e. similar characteristics). Future work is planned to validate our approach considering a time-aware selection in the cross-project setting.

Conclusion Threats to Validity

Conclusion threats to validity concern the relationship between the treatment and the outcome. To provide support for the conclusions derived from the obtained results, we use Wilcoxon signed rank test (Wilcoxon et al. 1970) with a 95% confidence level while using Bonferroni correction (Armstrong 2014). Vargha-Delaney A (VDA) (Vargha and Delaney 2000) is also used to measure the effect size. This non-parametric method is widely recommended in SBSE context (Nejati and Gay 2019). The employed statistical analysis provides strong evidence for validating our assumptions and our experimental study. Hence, we believe that there is negligible threat to the validity of our conclusions.

Threats to External Validity

are concerned with the generalizability of results since the experiments were based on free-to-download android apps. Hence, future replications of this study are necessary to confirm our findings in other contexts such as paid Android applications and iOS mobile applications.

8 Related Work

In this section, we review the related works that can be divided into 2 classes (1) analysis of user feedback, and (2) release engineering in mobile apps.

8.1 Studies on User Review Analysis in Mobile Apps

Several research works have analyzed user reviews in mobile apps to extract knowledge about different mobile app development aspects. Pagano and Maalej (2013) investigated user reviews and found that users tend to provide their review feedback shortly after a new app release, while negative feedback (e.g., shortcomings) is generally destructive. Later, Maalej and Nabil (2015) used various techniques to collect different features from user reviews, then used different ML algorithms to label reviews into four categories: (i) feature request, (ii) bug report, (iii) user experience, and (iv) unspecified. Similarly, ?panichella2015can,panichella2016ardoc () proposed an approach named as App Reviews Development Oriented Classifier (ARdoc) which classifies user reviews into five categories: (i) feature request, (ii) bug report, (iii) providing information, (iv) requesting information and (v) others. El Zarif et al. (Zarif et al. 2020) studied users’ feedback and found that users express their intentions to switch to competitors when facing issues in the used software systems. Hence, Assi et al. (2021) proposed FeatCompare that extracts features from user reviews of competitor apps. The obtained results show that FeatCompare outperforms the existing state-of-the-art approaches with 14.7% on average. They also found that 70% of the surveyed app developers agree on the potential benefits of using FeatCompare to extract features of competitor apps. Hu et al. (2019) studied the consistency of star ratings and reviews of popular free hybrid Android and iOS apps. They found that some hybrid apps do not obtain coherent user ratings across platforms. Sarro et al. (2018) showed the possibility of predicting user ratings for an app based on the features it offers in Android and BlackBerry with high accuracy.

To investigate the user rating influence, Harman et al. (2012) studied over 32k BlackBerry applications and found a high correlation between the average user rating and the number of downloads of an app. Later, Martin et al. (Martin et al. 2016; Martin et al. 2016) found that paid and free app releases tend to have a positive impact on the success of an app and that free apps with significant releases are more likely to have positive effects on the user ratings. Moreover, they found that app releases related to bug fixes and new features are more likely to increase user ratings. Moreover, Noei et al. (2017) found that some specific mobile device characteristics (such as CPU) have a high relation with the user-perceived quality. Recently, Hassan et al. (2017) investigated emergency updates in Android apps and revealed that these emergency updates are unlikely to be followed by other emergency updates so that they tend to have a long longevity. The study also revealed that emergency updates are often preceded by updates having more negative user reviews than the emergency ones themselves. Moreover, Khalid et al. (Khalid et al. 2014) investigated iOS mobile apps user complaints from 20 iOS app reviews and identified 12 types of complaints. Most of these complaints were related to functional errors, as well as privacy and ethics-related issues. Chen et al. (2021) studied User Interface (UI) issues mentioned in the reviews of 31,579 apps in the Google Play Store and found that UI-related reviews have lower ratings than the other reviews. Moreover, Chen et al. identified seventeen issue types (e.g., layout and navigation) related to the UI of mobile apps. Gui et al. (2017) studied various aspects of advertisement (ads) libraries and found that most Ad complaints leading to negative reviews were related to user interface concerns such as the frequency, timing and location of the displayed ads.

Several studies exploited user reviews and complaints to help in various maintenance and evolution activities. For instance, Ciurumelea et al. (2017) studied the textual description of user reviews then leveraged machine learning and information retrieval techniques to plan for the next release. Their approach aim at categorizing reviews and recommending the relevant source code files that should to be modified to address the issue described in the user review. Palomba et al. (Palomba et al. 2017) introduced an approach namely ChangeAdvisor that automatically analyzes user reviews from which it recommends source code artifacts to be changed using natural language processing and clustering algorithms.

8.2 Studies on Releases Engineering in Mobile Apps

Several research works focused on studying release practices in mobiles apps. Nayebi et al. (2016) performed a survey to study release strategies adopted for mobile apps and their impact on users. Their study shows that experienced developers are mostly aware that their release strategy affects user review and expressed interest in accommodating users’ feedback in their release strategy. From user perspective, the study revealed that while users value apps with frequent updates, they also point out that frequent updates could negatively affect users’ opinion about an app. Later, Domínguez-Álvarez and Gorla (2019) studied mobile apps releasing practices in both iOS and Android and found that developers make new releases of their apps more frequently in Android than on iOS. They also found that there is no synchronization in releasing apps on both platforms.

Calciati et al. (Calciati et al. 2018; Calciati and Gorla 2017) studied the evolution of Android apps to investigate how apps behavior changes across different releases of the same app. Most of the observed changes are related to an increased leak of sensitive data, an increase of added of permission, and an increase of API calls related to dangerous permissions in posterior releases over time. Nayebi et al. (2017) built different analogical reasoning models to predict Android apps release marketability based on changes in release and code attributes. The obtained results indicate that Android app releases follow certain patterns over time that allow predicting the success of future releases success.

Xia et al. (2016) are the first to propose a machine learning technique to predict crashing mobile releases. Using a number of change factors such as complexity, time, and diffusion, they trained a Naive Bayes classifier to predict crashing releases for 10 open source apps. Their results revealed that the technique can improve the prediction of random guessing by 50% and 28% in terms of F1 and AUC, respectively. Later, Su et al. (2020) studied crashing releases in open source and commercial Android apps based on thrown exceptions and found that Android framework-related exceptions (e.g., app Management, Database and Widget) and library exceptions are the main root causes. Recently, Yang et al. (Yang et al. 2021) analyzed the release notes of 69,851 releases for 2,232 apps in the Google Play Store and identified six patterns of release notes (e.g., apps with short and rarely updated release notes). The obtained results show that apps with long release notes have higher ratings than other apps. They also found that apps with shifting in their release notes patterns have encountered an increase in the average rating of these apps. Recently, Hamdi et al. (Hamdi et al. 2021; Hamdi et al. 2021) conducted a longitudinal study on refactoring activities in Android apps. They found that while developers often refactor their apps source code, bad coding and design practices are unlikely to be removed though refactoring.

While there are several studies on user reviews and release practices and issues in mobile apps, there are no specific approaches to predict bad releases. In our approach, our goal is to track all bad releases, including crashing ones by leveraging the user feedback as reviews typically include the experienced crashes/issues by end users.

9 Conclusion and Future Work

This paper proposed a novel search-based approach for bad mobile apps tracking, AppTracker, in which we adapted NSGA-II to generate optimal detection rules for each class (i.e. bad, good and neutral). The rules have tree-like representations in order to find the best trade-off between two conflicting objective functions to (1) maximize the true positive rate, and (2) minimize the false positive rate of the binary classification. An empirical study is conducted on a benchmark of 50,700 updates of 1,717 free Android apps having over 50,700 release updates. Considering two validation scenarios namely cross-validation and cross-project validation, the statistical analysis of the obtained results reveals that AppTracker is advantageous over mono-objective Genetic Programming (mono-GP) and seven Machine Learning (ML) techniques, which confirms that our formulation is better to solve the problem. Regarding the bad updates analysis, we found that (1) the previous updates ratings and (2) the APK size are the most important features for both within and cross-project scenarios.

Our future research agenda includes performing a larger empirical study with apps from other stores with free and paid applications. We also plan to consider other metrics, e.g., code-level quality. Furthermore, we plan to extend our AppTracker approach in the form of a bot to integrated into the development pipeline of Android apps to notify developers of their updates’ risk before releasing a new version to end-users. Furthermore, while the prediction of the corresponding class (good, bad, or neutral) is helpful for Android developers to follow better release practices and improve users experience, predicting the specific negativity ratio would provide a more fine-grained analysis. Hence, as future work it is interesting to build regressor-based models to estimate the negativity ratio. Moreover, we plan to implement a bot based on AppTracker and conduct a user study with our industrial partner to better evaluate our approach in an industrial setting.