1 Introduction

The widespread diffusion of smartphones and other mobile devices makes the mobile applications market very dynamic and profitable. The quality and, in particular, the reliability of such applications may be key factors that determine their success. The rapidity with which mobile applications have to be evolved to maintain their appeal and to be adapted to the characteristics of new devices makes testing and quality assurance very important and critical activities.

Manual testing of mobile applications may be a very costly activity both in terms of time and resources, as well as an extremely boring, repetitive, and error-prone activity. For example, there is a large fragmentation of mobile systems and devices, as witnessed by the recent study of OpenSignal that has found in August 2015 the existence of more than 24,000 different types of devices supporting AndroidFootnote 1. A consequent issue is related to the need to repeat the same tests on a very large number of different devices and execution environments. Not surprisingly, a growing interest in mobile testing automation techniques and tools has been demonstrated by the industry. For example, as regards the Android framework, three technologies supporting testing activities have been developed and distributed since the first versions of the framework in 2008 (i.e., the Monkey tool, which is capable of automatically triggering random event sequences, the MonkeyRunner scripting language, and the InstrumentationTestCase library by means of which the tester can implement automatically executable test cases), and they have contributed to the diffusion of the Android framework. Moreover, both Google and Amazon have recently released some cloud services supporting the automated testing of Android applications (i.e., Android Robo Test from GoogleFootnote 2 and the built-in Fuzz Test from AmazonFootnote 3).

A great interest in mobile testing automation has been recorded in the literature, too. The first papers related to testing automation of Symbian applications have been published in 2006 (Delamaro et al. 2006), while the first paper related to Android smartphones dates back to 2010 (Liu et al. 2010a) and the first secondary papers discussing challenges, approaches, and future directions of mobile testing automation have been published in 2012 and 2013 (Muccini et al. 2012; Amalfitano et al. 2013b; Dubinsky and Abadi 2013; Kirubakaran and Karthikeyani 2013).

Nowadays, there is a large fragmentation of papers focused on different aspects of mobile testing automation including functional testing, security testing, usability testing, context-awareness testing, and energy efficiency assessment (Zein et al. 2016). This fragmentation, together with the continuous proposal of new techniques and prototypes of tools, makes it difficult for researchers and practitioners to have a clear view of the state of the art in mobile testing automation.

Systematic mapping studies represent a well-known mean to shed light on a wide area of research by systematically classifying all the contributions in literature with respect to a given set of categories. According to Petersen et al., a software engineering systematic map is a defined method to build a classification scheme and structure a software engineering field of interest (Petersen et al. 2008). Systematic mapping studies are different from systematic literature reviews that are focused on a more qualitative review of the contributions found in literature. From this point of view, systematic mapping may represent the starting point for systematic literature reviews (Kitchenham et al. 2009).

In this paper, we present a systematic mapping study centered on techniques and tools supporting the automation of functional testing activities on mobile applications.

According to ISO 29119 Software Testing Standard (2013), testing of the functional characteristics of the software under test is referred as functional testing, while testing of the other quality characteristics is referred as non-functional testing. In this study, we intend as functional testing all the activities related to the verification of the correct execution of the whole application under test or of some of its parts, with respect to its functional requirements. It has been distinguished by non-functional testing that is driven by the quality requirements of the applications, such as its security, privacy, usability, performance, and energy efficiency.

There are two main reasons behind the selection of this specific research area in this study. The first reason is the relative lack of similar studies in literature.

In fact, some secondary papers in literature deal with broader research areas, such as the study of Holl and Elberzhager (2016) that focused on quality assurance of mobile applications; the one of Zein et al. (2016) that deals with security, privacy, usability, and context awareness testing (and found only 29 works directly related to testing automation); the one of Sahinoglu et al. (2015) that considers both functional and non-functional testing activities; and the one of Ahmad et al. (2018) that considers development challenges (including testing) related both to native mobile applications, to mobile web applications, and to hybrid applications. Other secondary studies such as the one of Corral et al. (2015) and the one of Mendez-Porras et al. (2015b) are updated to 2013 and 2014, respectively. The work of Mendez-Porras et al. (2015b) is the most similar to ours since its specific topic is mobile testing automation. With respect to this paper, we have followed a more accurate methodology and to perform a more detailed analysis and classification of the contributions in literature, taking into account the papers published in the last 3 years, too.

The second reason is the fundamental importance of testing automation activities in the context of mobile applications. In fact, even the automation of a single testing task, such as the generation of test cases, their execution or the oracle evaluation may produce a remarkable reduction of the costs of the testing process (Crispin and Gregory 2009). Consequently, testing automation may make feasible the execution of complex and effective testing processes that are instead too much expensive for manual testing approaches. In addition, the automation of functional testing activities may enable the automation of quality assurance activities, as it can be observed in several works in literature. Automatically generated functional test suites can be reused in the context of compatibility testing of mobile applications with respect to different combinations of devices and operating system versions (Vilkomir and Amstutz 2014; Vilkomir et al. 2015; Zhang et al. 2015a). Another example is represented by the approach proposed by Behrouz et al. (2015) that exploits test cases generated by static and dynamic analysis techniques (including random testing techniques) to explore the execution scenarios of an Android application and to measure the energy consumption. Finally, Canfora et al. (2013) have developed a system to measure some user experience parameters on real devices by exploiting techniques supporting the automatic execution of test cases.

The systematic mapping study presented in this paper has been carried out on the basis of the guidelines proposed by Petersen et al. (2008). First of all, four goals have been formulated that aim at (1) the classification of the works in literature in terms of their support to testing automation, (2) the evaluation of the characteristics of the proposed techniques and tools, (3) the classification of the proposed techniques and tools in terms of how they have been evaluated and compared with the state of the art, and (4) the identification via bibliometric analyses of the most prolific authors, institutions, and countries, the most influential publications, and the venues and journals that more frequently have included papers related to this topic. By means of the GQM approach (Basili et al. 1994), a number of research questions and metrics related to the proposed goals have been defined.

The literature research has been carried out on seven different search engines with a set of search queries that have been designed by improving the ones proposed by previous secondary works in literature and validated with respect to a set of relevant papers. The 4509 papers retrieved by search engines have been filtered on the basis of inclusion and exclusion criteria obtaining a set of 131 relevant articles. Each of these articles has been analyzed in depth by the authors in order to extract the information necessary to fill the map. The systematic map has been analyzed in order to provide a detailed description of state of the art in this research field with its emerging trends and existing gaps. The systematic map is available online at https://goo.gl/678T5P for validation purposes: our intent is to periodically update it in the future.

The remainder of the paper is structured in the following way: Section 2 provides a survey of the secondary works in literature that are related to the mobile testing automation area; the adopted research methodology is described in detail in Section 3, while the results of the systematic mapping study are presented in Section 4. Threats to the validity of the presented study are discussed in Section 5 while further discussions about the way to identify the most relevant contributions found in literature, the emerging trends, and the current research gaps are included in Section 6. Finally, conclusions are reported in Section 7.

2 Related work

Several works in the last years have studied challenges, research directions, and trends in mobile application testing, usually based on informal surveys of existing literature and tools.

In particular, Muccini et al. (2012) have summarized the main challenges and research directions in mobile testing automation, while Amalfitano et al. (2013b) have provided a wider view on this field in 2013, by focusing on open issues from different testing perspectives. Another similar work has been presented in 2013 by Kirubakaran and Karthikeyani (2013), that discussed about challenges and solutions provided by testing automation techniques. Also in 2013, Dubinsky and Abadi (2013), in the context of the Workshop on Mobile Development Lifecycle, have collected from participants a list of 45 challenges that they have organized in three research questions for planning future researches on mobile testing automation. Arzensek and Hericko (2014) and Saad and Awang Abu Bakar (2014) in 2014 have focused their contributions respectively on the criteria to select the characteristics of a mobile application testing tool and on the features of the existing commercial testing tools. Gao et al. (2014) have provided a large overview of mobile testing and tools up to 2014. Good surveys of mobile applications automated testing techniques are also included in the related work sections of some recent papers (Moran K et al. 2016; Mao et al. 2016). Finally, in 2018, Ahmad et al. (2018) have identified the challenges of both native, web, and hybrid mobile applications by means of an empirical study and have found how testing is still a relevant challenge for each type of mobile applications. A different perspective has been considered by Kochhar et al. (2015) and by Silva et al. (2016) that have studied how mobile application testers try to automate their work and what their needs.

Recently, secondary works addressing the problem of the comparison of the performance of techniques and tools for mobile testing automation have been published. Choudhary et al. (2015) in 2015 have presented an empirical comparison of the performance of some of the most popular mobile testing automation tools available in literature, while Amalfitano et al. (2015b, 2017) and Jiang et al. (2017) have presented empirical comparisons focused on different testing techniques implemented in the context of the same testing tool.

Several works have addressed specific topics in the area of mobile testing automation. For example, Harrison et al. (2013) have proposed a literature review specifically focused on mobile usability testing, while Li et al. (2016a) have recently proposed a technical report with a systematic literature review of techniques and tools for static analysis of Android apps.

Several other works have addressed research fields that are wider than the one addressed in our study. Corral et al. (2015) have presented a systematic mapping updated to 2012, focusing their attention on development practices helping testing and quality assurance of mobile applications. More recently, Holl and Elberzhager (2016) have presented a systematic mapping regarding quality assurance of mobile applications. They have provided answers to seven research questions regarding the approaches found in literature supporting both testing and quality assurance of mobile applications, including automatic and manual approaches, static and dynamic analysis, and functional and non-functional testing. The study have selected 230 papers from Scopus, ScienceDirect, IEEE, and ACM, published up to 2015.

Another recent systematic mapping study is the one presented by Zein et al. (2016). This work is focused on a wide spectrum of testing techniques, including studies on security, privacy, usability, and context awareness testing by including 79 relevant papers after a very selective process. In particular, they have selected a subset of 29 papers related to mobile testing automation. Another similar study is the one of Sahinoglu et al. (2015) that in 2015 have presented a systematic mapping including both functional and non-functional mobile testing approaches published up to 2014. They have included 123 papers and their research questions have studied at a coarse level of detail the typologies of testing activities focused on by the selected papers.

With respect to the studies (Corral et al. 2015; Holl and Elberzhager 2016; Sahinoglu et al. 2015), we have restricted the research field to mobile functional testing automation. In this more focused context, we have designed questions addressing specific aspects related to mobile functional testing automation. With respect to the study of Zein et al. (2016) that considers functional testing automation besides other different types of mobile testing and quality assurance activities, we did not select only papers providing an empirical evaluation of the proposed techniques and tools. We included also papers providing a demonstration of the feasibility of the proposed techniques.

The unique systematic mapping study related to the specific field of automation of functional testing for mobile applications is the one of Mendez-Porras et al. (2015b). This work addresses several general questions on the basis of a selected subset including 83 papers up to 2014, such as bibliometric analyses of authors, journals, and venues, the challenges of automated testing of mobile applications, the proposed techniques and approaches, and the adopted evaluation methods. This work has represented a good starting point for our research. In particular, with respect to this work, we have improved the investigation protocol by performing an objective validation of the proposed queries, by widening the search to other search engines such as ACM and Google Scholar, and by performing a more detailed analysis of the existing works, on the basis of a richer set of research questions.

3 Research methodology

The research methodology followed in this study is based on the guidelines provided by Petersen et al. (2008, 2015) and Kitchenham and Charters (2007).

The process followed in this paper is composed of six sequential steps. Each step includes a list of tasks that have been executed sequentially and each task has produced some outputs. Figure 1 shows the steps, the tasks, and the main outputs of the process.

Fig. 1
figure 1

The systematic mapping process

The first step consists of the definition of the research questions that will guide the process: to this aim we have adopted the Goal/Questions/Metrics paradigm. At the end of this step, a candidate classification scheme has been designed.

In the second step, the strategy for searching relevant papers in literature has been defined. Firstly, a set of sources of evidence has been selected, then a set of queries on these sources has been defined, executed, and validated. The results of the execution of these queries represent the initial set of candidate papers.

In the third step, two sets of inclusion and exclusion criteria have been defined to filter the studies that are relevant to this systematic mapping.

The fourth step is devoted to the screening of papers and to the selection of the ones relevant for the topic addressed by this study. It consists of a preliminary elimination of duplicated studies followed by the Keywording of Abstracts task. In the execution of this task, the title, keywords, and abstract of each paper have been analyzed in order to evaluate which papers can be excluded since they do not satisfy all the inclusion criteria or they satisfy some of the exclusion criteria.

In the Data Extraction and Mapping step, the filtered set of papers has been analyzed in detail by reading each of them. The papers that satisfy the above-defined criteria have been included in the study and the data needed to fill the classification scheme have been collected. The final systematic map has been obtained at the end of this step.

Finally, analyses and discussions of the metrics evaluated on the systematic map have been carried out and reported in this paper that represent the output of this last step.

Details about the execution of each step of the systematic mapping process will be reported in the next subsections.

3.1 Definition of the research questions

This study builds a classification scheme of the works in literature related to the proposal and the evaluation of techniques and tools supporting the automation of functional testing activities in the context of mobile applications.

In order to express this objective in terms of research questions and to link them to metrics the Goals/Questions/Metrics (GQM) paradigm originally proposed by Basili et al. (1994) has been applied. This methodology has been used both to formulate the research questions and to define the metrics needed to map the studies. A detailed description of the proposed goals, questions, and metrics follows.

3.1.1 Goals

The four goals of this study are:

  • G1   To classify the articles in the area of mobile application functional testing automation on the basis of the offered support to testing automation and of the addressed testing levels.

  • G2   To study in detail the characteristics of the proposed techniques and tools, such as the enabling inputs, their support to the generation of test cases, the generated test outputs and the supported mobile frameworks.

  • G3   To study how the proposed techniques and tools have been evaluated in terms of the characteristics of the experiments that have possibly been carried out, the involved applications under test, and the comparisons with other techniques and tools.

  • G4   To identify the most active researchers in this area and their affiliations, the most attractive venues and journals for papers in this field, and the most influential papers.

3.1.2 Research questions

For each of the four proposed goals, a set of specific questions have been formulated. With respect to the first goal (G1), the following questions have been posed:

  1. RQ 1.1

    What testing activities are automated?

  2. RQ 1.2

    What testing levels are addressed?

The first question is aimed at classifying the selected papers with respect to the degree of automation they provide to the testing process. In particular, we have evaluated if the techniques and tools proposed in the considered articles support the automation of test case design and implementation, test case execution, and oracle definition and evaluation. The second question aims at the classification of the papers according to the addressed testing level (i.e., unit testing, integration testing, or system testing).

The second goal (G2) has been addressed by the following seven research questions:

  1. RQ 2.1

    What inputs are used by the proposed testing techniques to derive test artifacts?

  2. RQ 2.2

    What kinds of techniques are proposed for test case generation?

  3. RQ 2.3

    What kinds of test oracles are considered?

  4. RQ 2.4

    What kinds of test artifacts are generated?

  5. RQ 2.5

    What are the characteristics of the proposed testing tools?

  6. RQ 2.6

    Which mobile frameworks are the targets of the proposed techniques and tools?

  7. RQ 2.7

    Are the proposed techniques and tools usable on emulators or real devices?

This set of questions aims at the detailed analysis of the main characteristics of the proposed techniques and tools. In particular, RQ 2.1 focuses on the inputs needed to enable the proposed techniques and tools such as source code, executable code, high-level models, existing test cases, or user sessions, while RQ 2.2 focuses on the proposed test case generation techniques, if any. The questions RQ 2.3 and RQ 2.4 respectively examine the considered test oracles, if any, and the outputs generated by the application of the proposed testing techniques. RQ 2.5 has been posed to collect information about the characteristics of the proposed testing tools (if any), such as their names, their dependencies on other external tools or resources, their availability (e.g., open source, free downloadable, commercial or not available), the type of technique adopted for test case generation (i.e., static analysis, dynamic analysis or hybrid, i.e., that combines static and dynamic analyses), the languages used for the tool implementation, the targeted execution framework, and the date of their last update. The questions RQ 2.6 and RQ 2.7 are directed to the technological characteristics of the proposed techniques and tools, since they express the technological scope (i.e., Android, iOS, Windows Phone, or others) and the possibility to apply them to real devices, emulated devices or both.

The third goal (G3) has been pursued by analyzing in detail the evaluation experiments carried out to validate the proposed techniques and tools and to compare their results with the ones obtained by using other similar tools. In particular, the following three questions have been posed:

  1. RQ 3.1

    What are the characteristics of the performed evaluation studies?

  2. RQ 3.2

    What are the characteristics of the sets of applications objects of the evaluation experiments?

  3. RQ 3.3

    What are the characteristics of the performed comparative studies?

Question RQ 3.1 is focused on the evaluation studies performed to validate the proposed techniques and tools. We have distinguished between papers providing experimental studies aiming at the evaluation of the performance of the proposed techniques or tools and papers providing only a demonstration of the feasibility of the proposed technique or tool. The question RQ 3.2 investigates the type and the quantity of the applications considered by the evaluation experiments reported in the selected papers. Finally, question RQ 3.3 regards the characteristics of the comparative studies that possibly have been carried out to compare the performance of the proposed techniques and tools with the ones of other tools considered as benchmarks.

Finally, the goal G4 has been investigated by posing some questions related to the demographics and bibliometrics of the selected articles and authors. The following questions have been formulated:

  1. RQ 4.1

    What is the number of published articles per year?

  2. RQ 4.2

    Which are the venues having the higher article counts?

  3. RQ 4.3

    Which are the more influential articles in terms of citation counts?

  4. RQ 4.4

    Who are the authors with the higher number of articles?

  5. RQ 4.5

    Which countries have produced more articles?

  6. RQ 4.6

    Which are the author affiliations?

The first question aims at the investigation of the trend of interest of the scientific community over the years with respect to the studied topic, while the second question summarizes the venues and journals that more frequently host contributions in this field. The third question aims at the evaluation of the papers that have had more influence on the literature by taking into account the number of works citing them. The fourth question has been posed to individuate the more prolific authors. Finally, the last two questions classify the papers in terms of the country of work of the authors and of their affiliation (academia or industry).

3.1.3 Metrics

In order to support the evaluation of the research questions, for each of them an attribute or a list of sub-attributes has been formulated. For each attribute and sub-attribute, a set of possible values has been defined.

These lists of attributes and possible values have been just sketched in the first step of the process leaving the possibility to refine them during the next steps. In particular, these lists have been modified during the Data Extraction and Mapping step and finalized at the end of that step. Table 1 reports, for each research question, the set of attributes and sub-attributes that have to be evaluated for each selected paper and the list of possible values for each of these attributes. For the sake of brevity, we have reported only the final version of this list. In Table 1, the sub-attributes have been written in italic.

Table 1 Final classification scheme reporting attributes and sub-attributes that have been evaluated on each selected paper and possible values for the attributes

The attributes have been designed in order to admit zero or more possible values for each paper. In fact, some attributes are not applicable to each paper (e.g., the tool characteristics are applicable only on papers presenting a tool).

In order to provide answers to the proposed research questions, for most of the considered attributes, we have considered and automatically evaluated the metric consisting in counting the occurrences of each value assigned to each attribute.

3.2 Search strategy definition

The search strategy adopted in this paper is inspired by the one proposed by Kitchenham and Charters (2007) and recently adopted by Zein et al. (2016).

In detail, (1) we have selected a set of sources of evidence (i.e., online available search engines), (2) we have selected a set of keywords able to drive the search, (3) we have formulated a set of tentative search strings including the selected keywords, (4) we have executed the proposed queries on the considered search engines, and (5) we have validated the results of these queries. In order to perform this validation, we have preventively built a list of papers that we consider relevant for the proposed topic and we have evaluated the ability of the formulated queries in retrieving the papers belonging to this list. The last three operations have been repeated until the tentative queries have been able to retrieve all the papers of this list.

The set of sources of evidence has been built by considering the online databases that index Computer Science literature and that have been often considered in other systematic mapping studies, too. In particular, the search engines that have been considered are as follows: ScopusFootnote 4, IEEExploreFootnote 5, ACMFootnote 6, SpringerLinkFootnote 7, ISI Web of KnowledgeFootnote 8, ScienceDirectFootnote 9, and Google ScholarFootnote 10.

In general, it is expected that the most part of the papers retrievable by using IEEExplore, ACM, SpringerLink, ISI, and ScienceDirect are retrievable by using Scopus, too, because Scopus has been designed to be an aggregator of all the contributions available from the most relevant publishers. Google Scholar, instead, has a larger scope, since it has been designed to index any content available on the web. For this reason, a larger number of contributions is retrievable by Google Scholar but the relevance of these contributions should be accurately evaluated.

After the selection of the sources of evidence, a set of tentative search strings has been built by selecting popular keywords used in this field. We have selected these keywords both on the basis of our specific knowledge of the research field and on the keywords used by similar systematic mapping studies in literature (i.e., Mendez-Porras et al. 2015b; Zein et al. 2016; Sahinoglu et al. 2015; Holl and Elberzhager 2016). Synonyms and other alternative keywords have been considered, too. The boolean operator OR has been used to consider different synonyms while the boolean operator AND has been used to link the different keywords. Since each search engine supports a different syntax for queries and provides different options to filter the search, different search strings have been formulated taking into account the peculiarities of the search engines.

Four main keywords have been considered: “mobile applications” (having as popular hyponyms the keywords “Android applications,” “apps,” “iOS applications,” “Symbian applications,” “Windows Phone applications”), “testing,” “technique” (for which different synonyms and hyponyms have been considered, such as “approach,” “method,” “tool,” and “framework”), and “automation” (and its synonyms “automated” and “automatic”).

The proposed search strings are conceptually equivalent between them. For all the engines, the searches have been restricted to title, abstract, and keywords of each indexed paper. In addition, the search on Scopus has taken into account, too, the titles of the papers referenced in the bibliography section of each paper. On Scopus and ScienceDirect, the search has been restricted to the Computer Science field to reduce the set of results. The possibility of extending the search to the entire text of the papers is available only in some search engines such as IEEExplore and ACM, but has been discarded in order to limit the wideness of the set of results (when the search has been extended to the full text of the papers, the same search strings returned 8,931 results for IEEExplore and 397,267 results for ACM). For Google Scholar, instead, the search regards always the full text of the papers. Google Scholar estimated that the number of returned results was higher than 56,000 but we have limited the analysis to the first 1,000 results that are the more relevant ones according to the Google Scholar ranking algorithm.

In order to validate the proposed search strings, we have selected a set of papers that have been judged relevant for this systematic mapping study. This set includes 55 papers recently cited by recent secondary studies in literature (Zein et al. 2016; Choudhary et al. 2015), in the related work sections of some recent and influential papers (Moran K et al. 2016; Mao et al. 2016) or in the Ph.D. thesis of one of the authors (Amatucci 2016). The other authors have preventively read these papers and have confirmed that they should be included in this study for their relevance.

Table 2 shows the final set of search strings that have been formulated for the seven considered search engines. In particular, the third column of this table reports the number of papers from this set of 55 relevant papers that have been retrieved, too, by each of the considered search engines, while the total number of papers retrieved by each query is shown in the last column. The last row of the table shows that all the 55 relevant papers have been found by at least one search engine and that the total number of retrieved papers (including duplicates) is 4509.

Table 2 The search strings that have been executed on the different search engines and the obtained results (the queries have all been executed on September 1, 2017)

The queries to the search engines have been carried out on September 1, 2017, and report only the papers indexed at that date. The results reported in Table 2 show that the Scopus search engine is able to find 54 out of the 55 relevant papers (the paper of Mirzaei et al. (2012) is only retrievable on ACM and Google Scholar), while Google Scholar is able to find 46 out of 55 relevant papers. The remaining search engines have retrieved smaller subsets of relevant papers because they index only papers from a subset of publishers. Since the selected search strings have demonstrated their ability to retrieve all the 55 selected papers in at least one of the considered search engines, they have been considered valid for this systematic mapping study.

The search strings reported in Table 2 have been selected after several trials characterized by worst performance in terms of retrieval of the set of 55 relevant papers. For example, the same search on Scopus limited only to title, abstract, and keywords fields returned 209 results including only 26 out of 55 expected papers.

Differently from our search strings, the more general search strings proposed by Zein et al. (2016) have been able to retrieve only 49 of the 55 relevant papers on Scopus, but with a total number of retrieved papers that is more than 10,000. The search strings proposed by the similar work of Mendez-Porras et al. (2015b) are able to find only 50 out of 55 relevant papers on Scopus and are not executable on some search engines such as IEEExplore, due to the current limitation of 15 keywords per query.

3.3 Study selection criteria definition

The purpose of the inclusion and exclusion criteria is to limit the study selection to papers that fit in the proposed topic, i.e., techniques and tools for the automation of functional testing of mobile applications and that are available in scientific literature. To this aim, a set of inclusion criteria useful to identify studies that could be considered in this mapping has been designed. In addition, another set of exclusion criteria has been formulated in order to exclude studies related to other fields of interest or not specifically focused on the selected topic.

In details, the following list of inclusion criteria has been considered:

  1. 1.

    Studies must be directly related to automated software testing techniques for native mobile applications.

  2. 2.

    Studies must be focused on functional testing of mobile applications, including, too, system testing, unit testing, integration testing, or any other testing activity aiming at the verification of the functional correctness of the application.

  3. 3.

    Studies must provide a qualitative or a quantitative evaluation of the proposed contributions.

In order to filter the set of papers from off topic ones, the following list of exclusion criteria have been defined:

  1. 1.

    Studies focused on testing embedded systems in general, and not directly related to mobile devices.

  2. 2.

    Studies focused on mobile communication infrastructure, mobile hardware, or robotics.

  3. 3.

    Studies focused on testing mobile applications different from native applications, such as mobile web applications.

  4. 4.

    Studies focused on other testing or quality assurance techniques, such as security testing, performance testing, energy consumption evaluation, usability testing, and compatibility testing.

  5. 5.

    Studies focused on static analysis without support to testing automation.

  6. 6.

    Studies related to other software development phases such as analysis, design, or implementation and not focused on testing.

  7. 7.

    Studies that merely present opinions or ideas without any proposed testing technique and any implemented testing tool.

  8. 8.

    Studies written in languages other than English or not available on the Internet in full-text form.

  9. 9.

    Studies that did not appear in the published proceedings of a peer reviewed conference, symposium, or workshop, or did not appear in a journal or magazine (i.e., thesis, technical reports, patents, blogs, or personal web pages).

  10. 10.

    Studies that are duplicates of other studies.

  11. 11.

    Surveys, reviews, mapping studies, and any other secondary study.

3.4 Screening of papers

During the execution of the Screening of Papers step of the process, the set of papers retrieved by the search engines has been progressively reduced by filtering duplicated and irrelevant papers.

A first filtering of the 4509 returned results consisting of the automatic elimination of duplicates, i.e., the elimination of all but the first entry of each paper retrieved in more than one search engine.

After this task, 3810 papers remained, as shown by the third column of Table 3. A majority of this papers (2409) are indexed by Scopus that is the first considered search engine. For example, the 86 papers in the IEEExplore row are articles that have been retrieved by IEEExplore but that have not been retrieved by Scholar.

Table 3 Number of selected papers after the different steps of the systematic mapping process

The filtering of the remaining papers has been executed by means of the Keywording of Abstracts task based on the inclusion and exclusion criteria presented in the previous subsection. In detail, the authors have analyzed title, abstract, and keywords (where available) of the 3810 considered papers and have filtered out all the articles that do not satisfy all the inclusion criteria or that satisfy one or more exclusion criteria. All the borderline papers for which this analysis has not been sufficient to include or exclude them have not been excluded in this step. After the execution of this screening activity, 351 papers remained. The fourth column of Table 3 reports the number of papers remaining, grouped for search engine.

The papers retrieved by IEEExplore, ACM, ISI, and ScienceDirect are generally very recent papers (not yet indexed by Scopus) or papers published in venues not covered by Scopus. The 105 papers retrieved only on Google Scholar correspond to papers not indexed by the other search engines or to papers retrievable only with a full-text research.

3.5 Data extraction and mapping of studies

The Data Extraction and Mapping step has been performed by the authors by completely reading the full text of the 351 selected papers.

The papers to be analyzed have been divided between the authors and each author evaluated if the paper should be included or excluded from the systematic mapping on the basis of the study selection criteria. In addition, for all the papers resulting from this filtering, the authors have assigned values to each of the attributes in Table 1, when applicable. The authors have had some joint meetings in order to discuss about the inclusion or exclusion of all the borderline papers and to review the values assigned to the attributes. Only the demographic and bibliometric data requested by the fourth goal of the study have been automatically extracted from the data exported from the search engines without any need for reviews or joint discussions.

At the end of this step, a set of 131 papers has been selected. The last column of Table 3 reports the final number of selected papers for each considered search engine.

3.6 Data availability

The Systematic Mapping reporting the complete list of the selected papers and all the values assigned to each attribute of each paper is not reported here for reasons of space, but it is available online at https://goo.gl/678T5P.

4 Analysis of results

The data extracted and collected during the previous steps have been aggregated in order to provide answers to the proposed research questions. In the following, the results obtained from the study and the answers that can be given to each research question will be presented and discussed.

4.1 RQ 1.1 What testing activities are automated?

Only 23 out of 131 papers provide fully automated testing processes including automatic test case generation and execution and automatic generation and evaluation of test oracles (Adamsen et al. 2015; Amalfitano et al. 2012a, 2015d; Costa et al. 2014; Hao et al. 2014; Hu et al. 2016; Imparato 2015; Liang et al. 2014; Liu et al. 2016; Maji et al. 2012; Mao et al. 2016; Mendez-Porras et al. 2015a; Mirzaei and Heydarnoori 2015; Moran K et al. 2015; Packevicius et al. 2015; Takala et al. 2011, White et al. 2015; Liu et al. 2014a; Zaeem et al. 2014; Zhu et al. 2015; Li et al. 2016b, 2017; Fazzini et al. 2016b).

Of the remaining papers, 63 present techniques and tools that are able to automatically generate and execute test cases but that do not provide any support to the automatic definition or evaluation of oracles.

On the other hand, we have found 32 papers that support the automatic test case execution and the oracle definition and evaluation (in most cases, the occurrence of crashes and exceptions has been evaluated).

In addition, we have found 13 papers that provide automation only for a single activity of the testing process. Five of them provide support only to the automatic generation of test cases that are not directly executable. In particular, in Puspika et al. (2015), Zheng et al. (2017), and Shabaan et al. (2017), model-based techniques are proposed, while Yu and Takada (2016) describe an approach based on the generation of external events and Liu et al. (2017) propose an approach exploiting machine learning techniques.

Seven other papers are focused on the automatic test execution. For example, in the paper of Griebe et al. (2016), the problem of executing test cases on speech-based applications is faced. Other studies are related to the automatic test execution in the context of Windows mobile applications (Mayan et al. 2015), Symbian applications (She et al. 2009; Jiang et al. 2007), or cloud infrastructures (Prathibhan et al. 2014). Finally, the works of Sadeh (2011) and Liu et al. (2014b) propose techniques for the automatic execution of unit test cases in the context of Android applications.

Finally, there is only one paper completely devoted to oracle evaluation. In the approach presented in Hsiao et al. (2014), the Android framework has been instrumented so that logs of the executions of the Android applications under test are generated while it is exercised by real users. These logs are automatically analyzed in order to detect concurrency races.

Table 4 reports the total list of the references of the 131 papers retrieved by the study, classified on the bases of the support offered to the testing activities.

Table 4 List of selected papers, grouped by their support to test automation

Figure 2 shows the distribution of papers according to the support offered to the testing activities.

Fig. 2
figure 2

Classification of papers according to the support to testing automation

4.2 RQ 1.2 What testing levels are addressed?

Most of the proposed papers (122 out of 131) present system testing approaches, in which the whole application is tested in its execution environment (real devices or emulators). In particular, there is a large number of GUI-based approaches, in which test cases are defined as sequences of events or interactions acting on the GUI of the application under test.

Only two approaches regard integration testing. The works of Maji et al. (2012) and Jha et al. (2015) are focused on the testing of the interactions between the main components of the application under test (e.g., activities, services, broadcast receivers) by means of specific intent calls.

Seven papers propose techniques and tools supporting unit testing (Delamaro et al. 2006; Sadeh 2011; Mirzaei et al. 2012; van der Merwe et al. 2012; Liu et al., 2014b, 2014c). In the works of Mirzaei et al. (2012) and van der Merwe et al. (2012) and Liu et al. (2014c), specific components of the Android applications are tested in isolation from the target execution environment. In fact, in these three approaches, the application components are tested on the traditional Java Virtual Machine (JVM) instead of on the specific Dalvik Virtual Machine that was mounted on almost all the Android devices at the time of the writing of these papers. The other four papers propose techniques and tools directly supporting unit testing of classes and methods included in Android applications (Delamaro et al. 2006; Sadeh 2011; Liu et al. 2014b; De Cleva Farto and Endo 2015).

A possible reason for which so few approaches to unit and integration testing are available in literature may be that many of the techniques supporting unit and integration testing designed for desktop applications can be reused on mobile applications, too. For example, the JUnit framework can be used to test any component of an Android application that is not dependent on any specific functionality of the targeted mobile device. Thus, there is no need of mobile specific unit testing techniques.

4.3 RQ 2.1 What inputs are used by the proposed testing techniques to derive test artifacts?

Most of the contributions found in literature exploits one or more of the following five sources of information: the source code of the application under test, its executable code (usually bytecode), high-level models, existing test cases, and user sessions.

There is a substantial equivalence between the number of white box approaches based on the analysis of the source code of the applications under test (53 out of 131) and the number of black box approaches needing only the executable code (57 out of 131).

Black box approaches have been often evaluated on very large sets of free applications available on public markets such as Google Play for Android, while white box approaches have often been evaluated on sets of applications found on public repositories of open-source mobile applications (for example, http://f-droid.org for Android applications).

Forty-seven approaches are based on the analysis of high-level models of the application under test. In particular, 20 papers are based on manually designed models of the application under test (including, for example, finite state machines (Nguyen et al. 2012; Majeed and Ryu 2016; Su 2016), sequence diagrams (Anbunathan and Basu 2016b), activity diagrams Griebe et al. 2015; Li et al. 2014a). Tweny-seven other approaches are instead based on models that are automatically generated by reverse engineering processes, such as GUI trees (Wang et al. 2014; Wen et al. 2015).

The 23 approaches based on existing user sessions include the 12 ones proposing capture and replay techniques able to collect and re-execute user sessions. The other ones are generally able, too, to transform existing user sessions in executable test cases. Usually, basic user events are considered by these approaches, but there are some papers specializing in the identification and generation of complex gesture events typical of mobile devices, such as the one of Hesenius et al. (2014).

Other 23 contributions are based on the transformation of existing test cases, that are often in the form of executable JUnit test cases for Android applications. Finally, there is only a preliminary contribution based on information obtained from bug repositories (Mendez-Porras et al. 2015a) in which only the requirements of a test generation system based on the analysis of a bug repository have been proposed.

The histogram in Fig. 3 reports the distribution of the selected papers with respect to the types of needed input sources.

Fig. 3
figure 3

Types of input sources used by the techniques and tools proposed in the selected papers

4.4 RQ 2.2 What kinds of technique are proposed for test case generation?

The automatic generation of test cases is a fundamental feature for most of the approaches found in literature. Only 14 out of 131 contributions are based on manually written test cases: in these cases, the automation is limited to test case execution or evaluation.

In the approaches presented in 69 of the 131 considered papers, the test case generation is obtained with model-based techniques. These techniques are based on high-level models (i.e., behavioral models such as sequence diagrams, activity diagrams, GUI trees, event flow graphs, finite state machines) or low-level models (i.e., models directly related to the code of the applications under test, such as control flow graphs or call graphs). In addition, in 29 papers, models are automatically generated during the testing process itself (they are usually called active learning techniques; Hao et al. 2015).

In 21 other contributions, test cases have been generated by transforming existing user sessions in executable test cases as shown by the answer to the previous research question.

In only 2 articles, test cases have been obtained by mutating existing test cases (Adamsen et al. 2015; Amalfitano et al. 2013a). Both Adamsen et al. (2015) and Amalfitano et al. (2013a) have injected specific sequences of events in existing test cases reproducing, for example, the closing and restart of the application or the loss of the Internet connection and to test the robustness of the application under test with respect to these events.

In 22 cases, random techniques support mobile application testing. Both uniform random techniques (for example in Choi et al. (2013), Machiry et al. (2013), Hu et al. (2014), and Amalfitano et al. (2015c)) and smarter random techniques (for example in Liu et al. (2010a), Hu and Neamtiu (2011a), Machiry et al. (2013), and Wen et al. (2015)) have been considered for test case generation.

A recent trend appears, the proposal of search-based testing techniques, usually based on genetic algorithms: in this study, we have found five contributions since 2014 (Mahmood et al. 2014; Zhu et al. 2015; Amalfitano et al. 2015a; Mao et al. 2016; Su 2016).

A summary of the techniques used for test case generation is shown in the histogram reported in Fig. 4.

Fig. 4
figure 4

Types of techniques for test case generation proposed in the selected papers

4.5 RQ 2.3 What kinds of test oracles are considered?

As regards the oracle definition, it is generally considered the most difficult phase of the testing process to be automated (Barr et al. 2015). In 66 of the considered papers, no approaches at all for automatic oracle definition and evaluation have been proposed. In addition, in 48 of the remaining approaches, the detection of crashes or exceptions represents the unique implicit way to evaluate the result of the executed test cases.

More specific test oracles have been proposed in few papers. In eight papers, models of the behavior of the application are available from application design or via reverse engineering, and an abstraction of the state of the GUI of the application under test has been proposed. In these cases, it is possible to define the expected GUI state at the end of the execution of each test case. The result of the test case is given by the comparison between the expected GUI state and the state of the GUI that has been obtained (Costa et al. 2014; Hao et al. 2014; Hu et al. 2014; Hu et al. 2015; Salva and Laurencot 2015; Joorabchi et al. 2016; Baek and Bae 2016; Hu et al. 2016). In particular, this approach has been used for the evaluation of the fidelity of the replayed traces in capture and replay techniques (Hu et al. 2014, 2015).

In five contributions, the result of the test is obtained by automatically comparing the screenshot of the current GUI with the expected one that usually has been obtained by previous executions of the same application (in Liu et al. (2010b), Lin et al. (2014), Mendez-Porras et al. (2015a), Packevicius et al. (2015), and Tang et al. (2016)), whereas seven other articles propose the evaluation of invariants, such as race conditions (Hsiao et al. 2014; Maiya et al. 2014; Bielik et al. 2015) or other specific invariants (Hao et al. 2014; Shan et al. 2016; Li et al. 2017). In particular, in Zaeem et al. (2014), invariants retrieved from the analysis of common bugs of the applications have been considered as oracles.

Finally, in five papers, manually written assertions have been used to evaluate the proposed testing tool (Fazzini et al. 2017; Liu et al. 2014a; She et al. 2009; Jiang et al. 2016; Wu et al. 2016).

Figure 5 shows the distribution of papers with respect to the types of oracles.

Fig. 5
figure 5

Types of oracles considered in the selected papers

4.6 RQ 2.4 What kinds of test artifacts are generated?

As regards the explicit outputs of the proposed testing techniques and tools, 75 papers propose techniques that generate executable test cases. In many cases, the tests can be executed only by means of the same tool able to generate them. In some cases, the proposed tools are able to export the generated test cases so that they can be executed outside the context of the test generation process (for example, in the form of JUnit test cases). For example, Android Ripper (Amalfitano et al. 2012b) is able to generate executable JUnit test cases, while RERAN (Gomez et al. 2013) is able to reproduce the same user sessions that have been captured, and Sapienz (Mao et al. 2016) is able to generate sequence of events by means of which it is possible to derive test cases.

In most of the approaches (112 out of 131 papers), a test execution report is provided as output, providing information about crashes, code coverage, or just an execution log. In addition, as shown by RQ 2.2, in 27 papers, models are automatically built during the testing activities and can be considered as an additional output that can be useful to comprehend the behavior of the tested application. Finally, in 15 cases, the proposed techniques are able to produce input values that can be used to automatically generate test cases.

4.7 RQ 2.5 What are the characteristics of the proposed testing tools?

Most of the selected contributions are not strictly theoretical or methodological, but they include the presentation of tools implementing the proposed testing techniques. In fact, 107 different tools have been presented in the selected papers, but only some of these tools are freely available. The source code of 22 of these tools is available, usually in the context of github projects, whereas six of these tools are available only in executable form or in demo version (i.e., Agrippin Amalfitano et al. (2015a), the executable only versions of Android Ripper used in Amalfitano et al.2012a, 2012b, EventRacer (Bielik et al. 2015), TrimDroid (Mirzaei et al. 2016b), and BARISTA (Fazzini et al. 2017)). In addition, three tools are commercially available (i.e., Testdroid (Kaasila et al. 2012), Caiipa (Liang et al. 2014), and MZoltar (Machado et al. 2013) for which a limited demo version is also available): they have borne as academic prototypes and they have evolved in commercial tools.

It is interesting to see that all the open-source tools presented in the selected papers have been developed for the Android platform, whereas Caiipa is the unique example of a commercial tool presented in the selected papers and developed for the Windows Phone framework.

The test generation techniques adopted by the 22 open-source tools have been classified by distinguishing between static analysis techniques (where tests are generated on the basis of information such as source code or high-level models of the application under test), dynamic analysis techniques (where tests are generated and executed on-the-fly by analyzing the application during its execution), and hybrid techniques (that combine static and dynamic analyses). We found six tools based on static analysis techniques: ICCMATT (Jha et al. 2015) and DroidFuzzer (Ye et al. 2013), that analyze the source code of the application under test including the Android manifest, BBoxTester (Zhauniarovich et al. 2015), and CRAXDroid (Yeh et al. 2014) that analyze the bytecode of the application, Magi[c] (Nguyen et al. 2012) that analyzes a statically designed FSM model of the application, and THOR (Adamsen et al. 2015), that mutates the source code of existing test cases.

Thirteen other tools are based on dynamic analysis techniques. Most of them exploit systematic model learning and/or random techniques for the automatic exploration of the GUI of the application under test (e.g., Android Ripper (Amalfitano et al. 2012b), A3E (Azim and Neamtiu 2013), SwiftHand (Choi et al. 2013), Dynodroid (Machiry et al. 2013), SlumDroid (Imparato 2015), DroidMate (Jamrozik and Zeller 2016), MCrawlT (Salva and Laurencot 2015), PUMA (Hao et al. 2014), SmartMonkey (Sun et al. 2016), DroidBot (Li et al. 2017), and DroidRacer (Maiya et al. 2014)). Finally, RERAN (Gomez et al. 2013) and VALERA (Hu et al. 2015) are capture and replay tools able to automatically generate and execute test cases corresponding to the observed executions of the application under test.

The remaining three tools combine static and dynamic analysis techniques. In fact, Sapienz (Mao et al. 2016) combines information obtained by static analysis and dynamic exploration techniques to implement a search-based exploration strategy. KREFinder and KREReproducer (Shan et al. 2016) respectively exploits a static analysis technique to find the resume and restart event handlers, and a dynamic exploration technique to test the behavior of the application under test with respect to the execution of these events. Finally, JPF-Android (van der Merwe et al. 2012) exploits static analysis to model an Android application in order to dynamically test it by adopting a Java PathFinder extension in the context of a Java Virtual Machine.

Not all these tools are currently maintained: 14 out of 22 have not been updated since 2016. The strict dependence between the tools and the rapidly evolving Android environment makes very hard and time consuming the maintenance of these projects. Choudhary et al. (2015) have performed a comparative experiment of the performance of several of these tools by executing them in the same execution environment. They have reported about their difficulties in adapting the tools to a common target execution environment. Most of the tools have been implemented partially or totally in Java in order to interact with the source code of the application under test and/or with the Android JUnit test environment. Some tools (in particular the ones interacting at low level with the Android framework) have been implemented by partially or totally using other languages such as Python (Ye et al. 2013; Zhauniarovich et al. 2015; Li et al. 2017; Maiya et al. 2014; Mao et al. 2016; Sun et al. 2016), Scala (Choi et al. 2013), Ruby (Azim and Neamtiu 2013), Kotlin (Jamrozik and Zeller 2016), C (Gomez et al. 2013), C++ (Hu et al. 2015), and Javascript (Adamsen et al. 2015).

Table 5 reports a summary of the characteristics of the 22 open-source tools, the type of the adopted test case generation technique, the URLs at which they are currently available, and the date of their last update.

Table 5 Characteristics of open-source testing tools

These tools are usually based on existing libraries and other tools. The most commonly used resource is the RobotiumFootnote 11 library supporting the writing of JUnit test cases, that is used by 21 tools. Other similar libraries are UIAutomatorFootnote 12 (used in 11 contributions), HierarchyViewerFootnote 13 (used in two contributions), and EspressoFootnote 14 (used just in Tang et al. 2016). Espresso is the library recommended by Google for the development of test cases but it is not yet considered by other academic studies, probably for its recent diffusion (it is available and integrated with Android Studio just since 2014). The Emma library, that is available in all the Android framework versions, has been used in 14 different contributions to measure code coverage. Other supporting tools often used and that are provided by the Android framework are MonkeyFootnote 15 (used in nine contributions) and chimpchat, that is included in MonkeyRunnerFootnote 16 (used in five contributions) that are both able to generate and send low-level random events to an Android application. Java Path FinderFootnote 17, that is a framework designed to analyze Java applications, has been used in five contributions to perform symbolic analysis of the Java source code of Android applications.

4.8 RQ 2.6 Which mobile frameworks are the targets of the proposed techniques and tools?

On the basis of the collected data, the mobile framework object of most of the existing studies in literature is the Android framework. In fact, 121 papers out of 131 propose techniques suitable for some Android framework versions.

The iOS-based systems are rarely objects of specific studies in literature, probably due to their proprietary nature, that makes it difficult for their diffusion in the academic community. In fact, we found in this study only a single paper proposing a testing technique applied to iOS applications (the recent one of Liu et al. 2017) and other four papers that propose techniques and tools that are both applicable to iOS and Android applications (Li et al. 2014a; Joorabchi et al. 2016; Gudmundsson et al. 2016; Zun et al. 2016). In addition, just three papers are focused on the less diffused Windows Phone framework (Liang et al. 2014; Ravindranath et al. 2014; Mayan et al. 2015). Six other papers are based on Symbian or J2ME (Delamaro et al. 2006; Jiang et al. 2007; She et al. 2009; Liu et al. 2010b; Nagowah and Sowamber 2012; Dev et al. 2012), but they have all been published before 2013 since this framework is clearly in a declining phase. Figure 6 shows the prevalence of Android-based contributions with respect to the ones based on the other platforms.

Fig. 6
figure 6

Distribution of papers according to the targeted mobile frameworks

4.9 RQ 2.7 Are the proposed techniques and tools usable on emulators or real devices?

Unfortunately, the details given in the considered papers are not always sufficient to provide an answer to this question. In some cases, in fact, no information at all has been provided, while in some other cases, this information can only be deduced by observing the description of the experiments reported in the evaluation section. On the basis of the available information, it appears that in 65 contributions the testing approach can be executed on emulators, while in 40 cases it can be executed on real devices and in the remaining 26 cases it can be executed on both real devices and emulators. In particular, the use of real devices enables some specific analysis involving different devices (e.g., Android TVs in Jiang et al. 2016), the verification of timing problems in capture and replay approaches (Gomez et al. 2013; Gomez et al. 2016; Ravindranath et al. 2014), and the detection of critical races (Hsiao et al. 2014; Maiya et al. 2014).

4.10 RQ 3.1 What are the characteristics of the performed evaluation studies?

In 45 papers out of 131, the validity of the proposed techniques and tools have been shown only on the basis of a demonstration of their feasibility obtained by showing examples of their use.

In the remaining 86 papers, experimental evaluations of the proposed techniques and tools have been carried out, and specific effectiveness metrics have been measured. The more frequently considered effectiveness metric is the number of failures that have been found (in particular crashes, exceptions or any other failure revealed by the considered oracles): these metrics have been evaluated in 54 different papers. Code coverage metrics (such as LOC coverage, method coverage, branch coverage, activity coverage) have been measured, instead, in 44 different papers. In 18 of these papers, both these evaluations have been carried out. More rarely, techniques and tools have been evaluated in terms of their ability in finding injected faults (in six papers).

Figure 7 shows the distribution of papers according to the different considered evaluation methods.

Fig. 7
figure 7

Distribution of papers according to the considered evaluation methods

4.11 RQ 3.2 What are the characteristics of the sets of applications objects of the evaluation experiments?

Most of the evaluation studies reported in the selected papers involve toy examples or real mobile applications. In only nine cases, the papers provide only a qualitative evaluation of the proposed techniques in terms of a description of the offered features, without any example or case study (Hesenius et al. 2014; Kaasila et al. 2012; Mendez-Porras et al. 2015a; Prathibhan et al. 2014; Reddy et al. 2016; Mirzaei et al. 2012; van der Merwe et al. 2012; Akanksha Ashok Magare 2016; Dutia et al. 2015).

In 30 of the remaining papers, the evaluation of the proposed techniques or tools is based only on toy examples, i.e., applications realized and proposed by the authors of the paper, while in all the other articles, case studies involving real mobile applications have been presented. In 13 papers, just a case study involving a simple application is presented, while in other 44 papers, several applications (from 2 to 10) have been used to assess the effectiveness of the proposed contribution. In 25 of the remaining cases, a more significant evaluation is presented, based on a number of applications between 11 and 100. Finally, in 10 cases, massive experiments involving more than 100 applications have been presented.

Figure 8 shows the distribution of the papers with respect to the number of application object of the evaluation of the proposed techniques and tools.

Fig. 8
figure 8

Distribution of papers according to the number of object applications

The tools experimented on more than 100 applications are BBoxTester (Zhauniarovich et al. 2015), EventRacer (Bielik et al. 2015), PUMA (Hao et al. 2014), DroidMate (Jamrozik and Zeller 2016), Caiipa (Liang et al. 2014), JarJarBinks (Maji et al. 2012), Sapienz (Mao et al. 2016), SIG-Droid (Mirzaei and Heydarnoori 2015), Vanarsena (Ravindranath et al. 2014), and KREfinder (Shan et al. 2016). The experiments involving the largest quantity of applications are all related to black box testing techniques searching for crashes or other invariant oracles, since they can be carried out in a fully automated testing process.

Many of the experiments involving a number of applications between 10 and 100 are aimed at the evaluation of white box testing techniques. For example, the model learning techniques provided by Dynodroid (Machiry et al. 2013), MCrawlT (Salva and Laurencot 2015), and Crashscope (Moran K et al. 2016) have been tested, respectively, on 50, 32, and 20 applications.

The applications under test selected for the experiments are usually open-source real applications (in 49 cases), often downloaded from F-DroidFootnote 18. In 32 cases, real applications published on an official market (usually Google PlayFootnote 19 market for Android apps) have been considered. Open-source applications from F-Droid have generally been used for experiments involving source code coverage measures, while applications downloaded from Google Play have generally been used for black box testing approaches.

In some cases (in particular, in the experiments involving open-source applications), it is possible to estimate the complexity of the tested applications since the authors have reported some size metrics (usually the number of LOCs of the applications under test). We have found this information on 28 papers and we can observe that in only two papers very small applications (less than 1 kLOC in average) have been considered, while in 17 cases the experiments involved relatively small applications (between 1 and 10 kLOC in average). Only eight experiments involved medium-sized applications having more than 10 kLOC in average.

4.12 RQ 3.3 What are the characteristics of the performed comparative studies?

Experiments involving the comparison between the effectiveness of the proposed approaches with the one of other existing tools represent a convincing way to evaluate the improvements of the considered approach with respect to the state of the art. Our study shows that they are not very common in this field.

In fact, only in 39 papers out of 131 is there at least a comparative study of the performance of the proposed testing approach.

The tool that is mostly used as baseline for comparisons is the Monkey tool (available in the Android Framework) that can be executed in a completely automatic manner, almost without any configuration effort. It has been considered as a term of comparison in 26 different articles. Other tools considered for comparative evaluations are Dynodroid (used as term of comparison in 10 papers), Android Ripper (in seven papers), and A3E and Swifthand (in three papers). In only four papers, comparisons between the effectiveness of the proposed testing tool and the one of test cases designed by human testers (students or the authors of the paper) have been presented.

Different attributes have been considered to compare the performance of the proposed approach with other existing ones: in particular, failure detection capability has been considered in 15 papers while code coverage capability has been used in 23 other papers. Simple comparisons in terms of offered features have been performed in 14 papers. In particular, in eight papers (Zhauniarovich et al. 2015; Zhu et al. 2015; Machiry et al. 2013; Mao et al. 2016; Moran K et al. 2016; Nguyen et al. 2012; Qin et al. 2016a; Wu et al. 2016), comparative experiments have been carried out, where failure detection capability and achieved code coverage have been both considered.

The papers presenting the comparative experiments involving the largest number of different tools are the ones of Salva and Laurencot (2015) and of Moran K et al. (2016). In the paper of Salva and Laurencot, the performance of the proposed MCrawlT tool is compared to the ones of Monkey, Orbit, Guitar, AndroidRipper, SwiftHand, and Dynodroid both in terms of offered features, achieved code coverage, and failure detection capability. In the paper of Moran et al., instead, there is a comparative study of the features offered by 20 different tools and an experiment for the comparison of the failure detection capability of the proposed CrashScope tool with the ones of five other different tools (A3E, Android Ripper, Dynodroid, PUMA, Monkey). However, other interesting comparative studies can be found in secondary works, such as in the recent works of Choudhary et al. (2015) and Amalfitano et al. (2015b, 2017).

4.13 RQ 4.1 What is the number of published articles per year?

The evaluation of the article count per year has confirmed the relative youth of the mobile testing automation field and a growing interest in its findings. In fact, as shown in Fig. 9, there are very few papers before 2011. These papers (Delamaro et al. 2006; Jiang et al. 2007; She et al. 2009; Liu et al. 2010b) are all about J2ME or Symbian, while the first paper based on Android applications is from the same authors of one of the papers on Symbian (Liu et al. 2010a). Since 2011, probably due to the great success and diffusion of the Android devices, there has been a rapid growth of the number of articles based on Android application testing (and in few cases on testing of Windows mobile phone applications or iOS applications), that has reached a peak in 2015 with 36 different papers. In 2017, eight papers have already been published at the time of this study.

Fig. 9
figure 9

Distribution of the selected papers per year and per used frameworks

4.14 RQ 4.2 Which are the venues having the higher article counts?

The growing interest of the scientific community for this field is also witnessed by the number of papers published and presented on general purpose, eminent conferences, such as the International Conference on Software Engineering (ICSE), hosting nine mobile testing automation papers in the last years six and the ACM Object-Oriented Programming, Systems, Languages and Applications (OOPSLA) that has hosted five articles. Several other articles have been presented in other general software engineering conferences: four at the Symposium on the Foundations of Software Engineering (FSE), four at the Asia-Pacific Software Engineering Conference (APSEC), three at the Automated Software Engineering Conference (ASE), three at the International Computer Software and Applications Conference (COMPSAC), three at the ACM Symposium on Applied Computing (SAC), two at the Conference on Programming Language Design and Implementation (PLDI), and two at the Conference on Software Engineering and Knowledge Engineering (SEKE).

The two specific communities hosting mobile testing automation papers are the testing community and the mobile computing community. In the testing community, we have found six papers presented at the Workshop on Automation of Software Test (AST), five papers at the International Symposium on Software Testing and Analysis (ISSTA), five at the International Conference on Software Testing, Verification and Validation (ICST), three at the International Workshop on TESTing Techniques and Experimentation Benchmarks for Event-Driven Software (TESTBEDS), and two at the International Symposium on Software Reliability Engineering (ISSRE). In venues related to the mobile computing community, instead, there are, among the others, six papers presented at the ACM International Conference on Mobile Software Engineering and Systems (MOBILESoft), two papers at the International Conference on Mobile Systems, Applications, and Services (Mobisys), two papers at the International Workshop on Mobile Development (Mobile!), and two papers at the International Workshop on Software Development Lifecycle for Mobile (DeMobile). Figure 10 shows in the form of histogram the venues hosting at least two of the papers included in this study.

Fig. 10
figure 10

Venues hosting two or more of the selected papers

Finally, in the last years, some of the more influential journals have published researches on mobile testing automation (such as two papers in ACM Software Engineering Notes (van der Merwe et al. 2012; Mirzaei et al. 2012): one paper in the Journal of Systems and Software (Qin et al. 2016a), in IEEE Software (Amalfitano et al. 2015d), in the IEEE Transactions on Software Engineering (Lin et al. 2014), and in the IEEE Transactions on Reliability (Jiang et al. 2016). In addition, two secondary papers related to mobile testing automation have been published in the Journal of Systems and Software (Zein et al. 2016; Amalfitano et al. 2017).

4.15 RQ 4.3 Which are the more influential articles in terms of citation counts?

An approximate analysis of the influence of the publications can be carried out by counting the overall number of articles citing the ones considered in this systematic mapping study. With reference to the count values provided by the Scopus search engine (that appears to be the more exhaustive search engine considered in this study), it is possible to observe that the most cited work is the one of Amalfitano et al. (2012b) presented at the IEEE/ACM International Conference on Automated Software Engineering (ASE) in 2012, having 156 citations up to September 2017. Other contributions of some of these authors are in the set of the most cited papers of Amalfitano et al. (2011, 2015d). Other works with more than 100 citations are the ones of Machiry et al. (141 citations), Anand et al. (2012) (112), Hu and Neamtiu (104), and Gomez et al. (101). Figure 11 shows the most cited papers considered in this study.

Fig. 11
figure 11

The most cited papers (up to September 2017)

It is interesting to note that almost all the most cited contributions have a practical relevance, since they present open-source testing tools, that sometimes have been used as benchmarks in citing papers.

4.16 RQ 4.4 Who are the authors with the higher number of articles?

The authors having the highest number of papers are currently Amalfitano, Fasolino, and Tramontana, who work in the same group at the University of Naples and are authors of eight different papers; Neamtiu (seven papers); and Linares-Vazquez (5 papers). It is interesting to note that only a very limited number of interchanges of authors between different groups have been observed.

4.17 RQ 4.5 Which countries have produced more articles?

The analysis of the countries of affiliation of the authors of the selected papers shows how this field of study is diffused worldwide, with contributions from all the continents. The countries with the higher number of contributions are the USA (38), China (26), Italy (10), and India (9). Figure 12 shows a world map where countries having more publications are filled in with darker colors. It is possible to observe, too, that only 17 works have been written by authors of at least two different countries of affiliation.

Fig. 12
figure 12

Map of the country of affiliation of the authors of the selected papers

4.18 RQ 4.6 Which are the author affiliations?

A detailed analysis of the affiliations of the authors of the selected papers has revealed that only few articles result from collaborations between authors from industry and authors from academia (17 papers), whereas the most part of the articles (115 papers) have been written only by authors from academia. No papers have been written only by authors from industry. The most active groups in this fields are the ones of the University of Naples Federico II (8 papers), of the College of William and Mary in the USA (6 papers), of the Nanjing University in China (6 papers), of the University of California (5 papers), and of the George Mason University (4 papers). As regards the industry, 3 papers are in collaboration with Microsoft Research and other 3 in collaboration with Fujitsu laboratories. Figure 13 provides a geographical view of the distribution of the affiliation of the authors of the selected papers where larger circles indicate the institutions from which larger numbers of papers come from.

Fig. 13
figure 13

Geographical map of the affiliations of the author of the selected papers

5 Threats to validity

Threats to the validity of a systematic mapping study are due to several possible aspects such as the suitability of the categorization scheme, the recall and precision of study selection, the accuracy of data extraction, and the correctness of the conclusions. In this section, we will discuss the main threats to the validity of this study and the actions that have been taken to mitigate them in coherence with the classification of threats adopted by Vahid Garousi in some of its systematic mapping studies (e.g., in Garousi et al. 2013; Garousi and Mantyla2016).

5.1 Threats to Internal validity

Threats to internal validity can be caused by the process adopted to select the articles considered in this systematic mapping.

A first threat is related to the capability of the designed queries in finding all the works in literature describing techniques and tools supporting automation of functional testing of mobile applications. To this purpose, the set of keywords considered in the queries has been initially designed on the basis of the domain knowledge of the authors and of the similar keywords used in the most similar studies in literature (in particular, the ones of Holl Holl and Elberzhager (2016), Zein et al. (2016), Mendez-Porras et al. (2015b), and Sahinoglu et al. (2015)). Queries resulting from several combinations of keywords have been formulated taking into account the characteristics and the limitations imposed by the search engines, too. In order to validate the proposed queries, they have been preliminary verified with respect to a set of 55 publications that should surely be included in the systematic mapping for their coherence with the topic of the study, as described in Section 3.2.

A second threat is related to the evaluation of the inclusion and exclusion criteria. To this aim, inclusion and exclusion criteria have been expressed in order to be objectively evaluated and to mitigate the risk of arbitrary judgements of the authors of the study. In addition, each author labeled as “borderline” each paper for which he has some doubts about its inclusion. The inclusion and the exclusion of these papers have been jointly discussed by all the authors in order to have a shared decision.

5.2 Threats to construct validity

Threats to construct validity are related to the suitability of the proposed RQs and of the attributes characterizing the categorization scheme. In order to limit this threat, the GQM approach has been used to preserve the traceability between research goals, questions, and metrics. The categorization scheme has been obtained in several steps. In the first step, a set of attributes and possible values has been jointly designed by all the authors. During the data extraction step, the authors have classified the papers with respect to the proposed categorization scheme but feeling free to add other possible attributes and values that in their opinion better fit the characteristics of the analyzed papers with respect to the proposed research questions. Finally, the authors have restructured the categorization scheme taking into account the added attributes and values. With this process, we think that the risk that relevant details of the analyzed papers have been neglected has been mitigated.

Another threat is related to possible inaccuracies of the extracted data. To mitigate this threat, for each considered paper the authors have labeled as borderline all the attributes for which they have some doubts about their values. The values assigned to these attributes have been discussed and fixed after a final joint review involving all the authors.

In more detail, we have observed that some questions are more prone to be answered inaccurately. As regards RQ 2.5, in order to know if each proposed tool is currently available on the web, we have considered the URLs declared in the papers and we have verified if the tool is actually available at that address. In addition, for the tools for which no URLs have been reported in the paper, we have searched online, on the basis of the name of the tool, if it is currently available. In this way, we have found that some tools that were not available at the time of the publication of the paper are now online, whereas some other tools are no more available at the indicated URLs. It is possible that some tools have changed their names or have been merged in other tools or frameworks, so we have not been able to find them. Of course, the list of available tools should be periodically updated in the online map. As regards RQ 2.7, many papers do not explicitly indicate whether the proposed techniques and tools are executable on real devices and/or emulators. In these cases, it was assumed that they can be executed on both. In some other papers, the answer to this question has been based on the experimental configuration declared in the evaluation section of the article. The number of citations needed to answer RQ 4.3 has been measured considering only Scopus, since it is the search engine providing the most complete view of the academic literature, as confirmed by the preliminary validation of the search queries. To this aim, we have discarded the measures provided by less inclusive search engines such as IEEExplore and ACM and the ones provided by Google Scholar that take into account many sources such as web sites, personal blogs, and other sources that are outside the scope of this study.

5.3 Threats to conclusion validity

In order to mitigate the possibility to give incorrect answers to the proposed research questions, we have formulated our conclusions only on the basis of the analysis of the extracted data. The online availability of the extracted data makes it possible for other researchers to independently validate the correctness of the conclusions. The spreadsheet with all the extracted data can be freely downloaded and commented on by readers and the authors will correct and update it periodically.

5.4 Threats to external validity

A first threat to external validity is related to its replicability. To avoid this threat, in this paper all the details needed to make possible an independent replication of the study have been reported.

Another threat is related to the generalization of the results of the study. The scope of this study is limited to the academic community, whereas its validity in the different contexts of software industry has not been evaluated. As regards the academic community, the completeness of the study is guaranteed by the set of considered search engines, which includes all those commonly used by software engineering researchers and by similar systematic mapping studies. On the other hand, the proposed strategy for study selection cannot be extended to the industrial context, where contributions may be found as blog posts, pages on commercial web sites, presentations, videos, and other forms. As recently shown by Garousi and Mantyla (2016), a multivocal literature review (MLR) could be the way to extend the scope of an academic systematic mapping study to the industrial world.

6 Discussion

This section presents a possible approach for selecting papers from the systematic mapping using a ranking metric and reports a discussion about emerging trends and current research gaps in the field of automated functional testing of mobile apps.

6.1 Extracting relevant papers from the systematic mapping

As we already have reported in Section 1, Petersen et al. (2008, 2015) specify that a systematic mapping provides a classification scheme and structures a software engineering field of interest. In order to show how different facets of this scheme can be combined to answer more specific research questions, we now present a possible approach for article selection that expresses the readers’ specific research interests.

To reach his specific objective, a reader will have to select a number of indicators from the ones measured in the systematic mapping study that he considers as relevant for his specific research interests. Therefore, the reader will have to design a scoring function to assign each indicator a score. Each paper will be ranked by means of a ranking metric that aggregates the single indicator scores. Finally, the reader will be able to select the papers that reach specific score values.

For example, a reader may be interested in selecting only articles that are characterized by a strong academic relevance in the literature and presenting techniques/tools having a strong practical relevance. To this aim, he may select the six indicators reported below and design the corresponding scoring functions for each of them.

A simple scoring function has been considered, that assigns each paper with a score of 0 or 1 for each indicator. An exception is represented by the indicator C4 for which the scoring function assigns one of the three possible scores of 0, 0.5, and 1.

The six considered indicators and the corresponding scoring functions are the following:

  • C1   The editorial relevance of the journal in which the paper has been included or of the conference at which the paper has been presented. Well-known ranking systems have been considered, i.e., ScimagoFootnote 20 for journals and the Core Rankings PortalFootnote 21 for conferences. One point has been assigned to papers published on journals belonging to the quartiles Q1 and Q2 according to Scimago and to papers presented at rank A or A* conferences according to Core Ranking, zero points otherwise.

  • C2   The number of citations to the paper, already presented discussing RQ 4.3. One point has been assigned to papers reaching at least the threshold of 50 citations, zero points otherwise.

  • C3   The availability of the tool. One point has been assigned to papers describing publicly available testing tools, zero points otherwise.

  • C4   The approach used to evaluate the effectiveness of the proposed testing techniques/tools. It has been checked if in the selected papers the effectiveness has been evaluated by means of coverage metrics (e.g., code coverage or model coverage) or by counting the number of failures/faults detected. 0.5 points have been assigned to papers in which the effectiveness has been assessed only by means of coverage measures, 0.5 points for papers reporting only the number of detected failures or faults. One point has been assigned to papers presenting both these approaches, and zero points to papers that do not provide any empirical validation of the proposed technique/tool.

  • C5   The size of the application sample involved in the possible empirical study reported in the paper. One point has been assigned to papers reporting the results of studies involving at least 10 applications, zero points otherwise.

  • C6   The existence of empirical comparisons between the techniques/tools proposed in the paper and the state of the art. One point has been assigned to papers reporting techniques that have been empirically compared against the state of art (in the same paper or in other selected papers), zero points otherwise.

The total score of each paper can range between 0 and 6: the greater the value, greater the relevance of the paper.

Of course, we are aware that the selection of the indicators and the corresponding scoring functions can influence the results of the evaluation and that the corresponding scoring functions could produce different rankings of the analyzed papers. In particular, with respect to the proposed ranking metric, recent papers are hindered by the number of citations while older tools could be not involved in comparative studies due to the evolution of the mobile execution environments. The scoring functions and the ranking metric have been directly evaluated on the latter columns of the spreadsheet available at https://goo.gl/678T5P and can be easily modified by the readers by modifying the formulas reported in that spreadsheet.

The 12 papers that have reached a score of at least 4 points are reported in Table 6, while the complete ranking can be seen in the last column of the online spreadsheet.

Table 6 Most relevant papers according to the proposed ranking metric

According to the proposed ranking, the more relevant paper is the one of Machiry et al. (2013) presenting the publicly available tool Dynodroid that have demonstrated its ability in automatically testing Android applications with both model learning and random-based techniques, reaching a good code coverage level and founding real failures. Dynodroid has demonstrated its testing adequacy in an empirical study involving 50 applications, comparing it with the one of Monkey. It has been often considered as a benchmark for the other tools developed since 2013.

The second most relevant paper is the one of Azim and Neamtiu (2013) that presents A3E, another tool able to automatically explore and test Android applications. A3E has demonstrated its testing adequacy in terms of code coverage in an experiment involving 25 applications, and has been used as a benchmark by some other studies, that have often obtained better performance. A3E has been publicly available since 2013.

Another very relevant paper is the one of Mao et al. (2016) that presents the Sapienz tool, that is able to automatically explore and test Android applications outperforming Dynodroid and some other tools available in 2016, both in terms of code coverage and capability of finding real software failures. The improvements introduced by Sapienz are essentially related to its effective combination of random, systematic, and search-based exploration techniques. Sapienz is publicly available but it is no longer maintained (as stated on the tool web site).

Another tool able to explore Android applications in a completely automatic way, with different possible random and systematic techniques, is Android Ripper, presented by Amalfitano et al. in a series of papers since 2011 (Amalfitano et al.2011, 2012a, 2012b, 2015d). The effectiveness of the tool in terms of achieved code coverage has been assessed by different experiments involving Android applications, and this tool has been used as a benchmark in many other different studies that have sometimes overtaken its performance. The tool is publicly available and maintained at the time of this research.

The paper of Choi et al. (2013) presents Swifthand, another tool for the automatic exploration and testing of Android applications that is publicly available and maintained up to 2015. It has been used, too, as benchmark by other studies (Mao et al. 2016).

Other very relevant papers present contributions related to the automation of more specific testing activities. The papers of Azim et al. have presented the tools RERAN and VALERA (Gomez et al. 2013; Hu et al. 2015) that are record and replay tools able to extract and analyze the sequences of events corresponding to user interactions with an Android application and to generate executable test cases able to reproduce these interactions with a high fidelity. The RERAN tool is currently publicly available and represents a very useful tool to automatically generate test cases by user sessions.

The paper of Maiya et al. (2014), instead, presents DroidRacer, that is devoted to the automatic research of concurrency bugs on Android applications. The DroidRacer tool has been capable to find many real bugs and is currently available and maintained. The recent paper of Shan et al. (2016) presents KREFinder, a tool able to find bugs due to incorrect management of the resume and restart of Android applications on the basis of information extracted via static analysis. KREFinder is publicly available and has been maintained up to 2016.

Finally, the paper of Hao et al. (2014) has presented a tool called PUMA and a language to configure it (PUMAScript) allowing the implementation of many different testing and quality assessment tasks on Android applications. The tool is publicly available and currently maintained. It represents the most flexible presented tool since it has been applied to several different testing activities.

In our knowledge, we can confirm the actual relevance of all the selected papers, so we are confident about the usefulness of the proposed metric.

6.2 Focus on GUI-based testing approaches for Android applications

Almost all the works found by this study concern techniques and tools applicable to the Android framework (about 92% of the total, as shown by the answer to RQ 2.6), with very few works that focus on other popular frameworks like iOS. We think that the reason of this polarization is related to the open-source nature of most of the Android tools, that makes it possible for researchers the realization of free testing tools and their free sharing for academic purposes.

In addition, most of the proposed approaches tackle the problem of testing by executing events on the GUI of the application under test. The reason for which GUI-based testing techniques are so popular may be due to the availability of libraries supporting GUI testing of Android applications via JUnit test cases since its earlier versions (the fundamental InstrumentationTestCase library has been released with the first version of Android in 2008).

The Android framework also provides tools and libraries allowing low-level interactions with the applications under test, such as MonkeyRunner or other basic system tools such as sendEvent and getEvent through which you can have access to the event stream of the application under test. They have been rarely exploited by the tools found in literature: a unique contribution in the literature is based on MonkeyRunner (Dutia et al. 2015), while sendEvent and getEvent are the basis for the capture and replay tools RERAN (Gomez et al. 2013) and Valera (Hu et al. 2015).

The low-level testing tool that is more often used by the tools retrieved in this study is Monkey that automatically generates random events and sends them to the device by means of the sendEvent tool. No contributions at all have been found regarding testing of native code components developed in C++ language using the Android NDK development framework. Moreover, no specific testing approaches focused on the automatic testing of components of Android applications such as services, broadcast receivers, and content providers have been found in literature.

Other testing issues for which few contributions have been found in literature are context-aware testing and concurrency testing.

Context-aware testing is an important issue for mobile applications due to the large availability of sources of contextual events on mobile devices (i.e., sensors, Internet connection, background services, etc.). Nevertheless, of this evidence, only 12 of the considered papers take into account these events (i.e., Amalfitano et al. 2013a; Gomez et al. 2013; Hu et al. 2015; Liang et al. 2014; Griebe and Gruhn et al. 2014, 2015; Adamsen et al. 2015; Song et al. 2015; Hu and Neamtiu 2016; Qin et al. 2016a; Yu and Takada2016; Arnatovich et al. 2016). On the other hand, the interest on this field appears to be growing since most of these publications have been published in the last 2 years. The recent studies of de Sousa Santos et al. (2017) and Matalonga et al. (2017) have further highlighted this lack in the current literature.

Concurrency races represent the cause of many failures in mobile applications. In particular, although in the first Android versions the support to concurrency was quite limited, a large support to develop concurrent Android applications is now available to developers, including the support for threads and asynchronous tasks. On the other hand, very few approaches devoted to the search of concurrency races from the tester point of view have been found in this study (in particular, only 5 papers have been found (Hsiao et al. 2014; Maiya et al. 2014; Tang et al. 2016; Hu et al. 2016; Li et al. 2016b)).

6.3 Scarce attention to fault modeling and finding

As observed by formulating an answer for the question RQ 3.1, the two more frequent testing targets of the papers included in this study are code coverage and failure detection, whereas bug finding and fixing received a very limited attention. In fact, the objective of more than 50 approaches is to find failures of mobile applications, including crashes, unhandled exceptions, concurrency races, and context-aware issues. On the other hand, only few papers attempted to find them by characterizing application faults. Only a single approach is focused on bug localization (the one of Machado et al. 2013) and only one other uses historical bug information from bug repositories to identify new bugs (Mendez-Porras et al. 2015a).

6.4 Distance between industry and academia

In our systematic mapping, we have found a very limited number of contributions from the industry or from collaborations between industry and academia. This can be due to the strategies followed by many companies that do not publish freely available testing tools. For example, the first three tools from academia that have been the basis for commercial projects (i.e., Testdroid (Kaasila et al. 2012), Caiipa (Liang et al. 2014), and MZoltar (Machado et al. 2013)) are not more freely available and no other academic publications regarding their evolution can be found in literature. Another example is related to Google, that has released several testing services for Android applications in the last years (e.g., the Espresso library, the Firebase Test Lab cloud environment, the Android Robo Test tool for automatic testing) but no academic publications demonstrating their effectiveness. On the other hand, information about these tools can be found in other forms, such as tutorials on the Android Developer web siteFootnote 22 or videos from the Google I/O eventsFootnote 23.

In addition, in the academic papers, we have not found any validation experiment involving testing of real industrial applications during their development, but only experiments involving black box testing of already published industrial applications (Zeng et al. 2016). This observation confirms the conclusion highlighted in the systematic mapping of Zein et al. (2016) about the absence of case studies on large commercial applications during their life cycle. On the other hand, differently from other observations reported in Zein et al. (2016), the answer to the research question RQ 3.2 shows how a relevant number of testing techniques and tools have now been evaluated with respect to large sets of real applications available on public markets.

6.5 Comparative studies and testing benchmarks

The analysis of both primary studies considered in the systematic mapping and secondary studies described in Section 2 has shown that there is a relative lack of papers addressing comparison experiments involving different techniques and tools for mobile testing automation, but that this is an emerging topic.

Comparative experiments are included in few papers and the experiments that have been carried out present many limitations in terms of replicability and in terms of generalization of the conclusions. We have investigated about the existing issues making difficult a fair comparison between the available testing tools. A first issue is the rapid obsolescence of the academic tools available in literature that is primarily due to the very rapid evolution of the Android framework and of its supporting development environment that make continuous evolutive maintenance tasks on the testing tools necessary.

For this reason, it is difficult to design a testing harness able to compare different testing tools in the same environment and it is difficult, too, to select a set of applications on which all the tools under comparison can be executed.

In addition, most of the available testing tools are released in the form of prototypes without the possibility to customize all the characteristics of the tools. For example, in many tools, it is not possible to set preconditions on the applications under test.

Several recent works have tried to face these issues. The recent contribution of Choudhary et al. (2015) represents the better tentative to perform a fair comparison between the available tools. They have realized a test harness able to test different tools in the context of the same machine and in the same execution conditions. To this aim, they have contacted the authors of the tools published in literature up to 2014 and with their collaboration they have modified almost all the available open-source tools in order to execute them on a common Linux-based environment. They reported that they have been able to include only 7 tools on their experimentation: Monkey, Dynodroid (Machiry et al. 2013), Android Ripper (Amalfitano et al. 2012b), A3E (Azim and Neamtiu 2013), SwiftHand (Choi et al. 2013), PUMA (Hao et al. 2014), and ACTeve (Anand et al. 2012). Due to the rapid evolution of the Android framework, it is also very difficult to establish a common benchmark of applications. For example, Choudhary et al. (2015) attempted to form a testing benchmark by considering a set of applications that was previously tested in the papers presenting the tools under comparison. They collected 68 applications, but they admitted that only 51 of them resulted as executable on each of the seven considered testing tools.

A different approach has been followed by Amalfitano et al. (2017), that have focused their attention on testing techniques, by comparing the performance of several active learning and random techniques in the context of the same testing environment.

The problem of the comparison between the performance of academic tools and commercial tools is still open, since only the open-source tool Monkey has been involved in comparative experiments. For example, there are no experiments comparing the performance of commercial tools such as Android Robo TestFootnote 24 from Google or capture and replay tools such as Robotium RecorderFootnote 25, Espresso Test RecorderFootnote 26, and Ranorex Android Test AutomationFootnote 27. In the same way, there are no papers presenting comparisons with testing frameworks for testing iOS applications, such as Earl GreyFootnote 28 and FrankFootnote 29.

6.6 Absence of specific venues and journals focused on mobile testing automation

A relevant number of papers focused on mobile testing automation has been found in literature, in particular in the last years, as shown in Fig. 9. Sometimes, in the recent past, specific sessions of scientific conferences have been centered on some aspects of mobile testing automation (for example, the sessions called “UI Automation” at MobiSys 2014, “Quality Assurance” at Mobilesoft 2015, “Mobile GUI” at ISSRE 2015, “Android” at ISSTA 2016, “Testing Smartphone Applications” at AST 2016). In addition, sessions related to testing mobile applications have been often hosted by industrial events such as the Google Testing Automation Conference (GTAC)Footnote 30 that do not publish papers indexed by the considered search engines. Although this is a good level of interest of the scientific community, no specific venues have been dedicated so far to mobile testing automation or, more in general, to mobile testing. Analogously, no special issues on this topic have been already published in international journals. These considerations appear as an evidence of the temporary absence of a cohesive community of scientists involved in this topic.

7 Conclusions

The systematic mapping study presented in this paper provides a panorama of the state of the art of the scientific literature in the specific field of the automation of functional testing of mobile applications. This study presents a classification of the contributions provided by a set of 131 papers in literature, selected by applying an accurate strategy based on validated search queries and on the application of a set of inclusion and exclusion criteria. The systematic mapping study has been guided by 4 main goals and a total of 18 different research questions.

This study represents an upgrade with respect to the existing secondary studies in literature focused on the specific field of automation of functional testing of mobile applications. It can represent a useful tool for researchers, students, and practitioners in order to have an overall and detailed view of the state of the scientific literature related to the considered topic. For validation purposes, all the information necessary for replication of the study have been made available in this paper, as well as all the extracted data are available online at https://goo.gl/678T5P in the form of a Google spreadsheet. The online spreadsheet can be freely downloaded or commented on by readers that can propose edit operations to the authors, too.

The analysis of the systematic map has allowed the individuation of some research trends and some gaps in this research area. In particular, this study has found that there are few contributions from the industry and that there is a lack of contributions regarding specific topics such as techniques and tools for testing iOS applications, testing tools based on Android Espresso, and testing techniques aiming at testing of C++-based components of mobile applications. In addition, a limited attention on some topics including context-aware testing, concurrency testing, and fault detection has been revealed. Finally, the analysis of bibliography has demonstrated the absence of specific venues and journal focused on mobile testing automation. We are confident that these gaps can be filled by researchers in the next years.