Challenges and pitfalls on surveying evidence in the software engineering technical literature: an exploratory study with novices

Ribeiro, Talita Vieira; Massollar, Jobson; Travassos, Guilherme Horta

doi:10.1007/s10664-017-9556-7

Challenges and pitfalls on surveying evidence in the software engineering technical literature: an exploratory study with novices

Published: 28 October 2017

Volume 23, pages 1594–1663, (2018)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

Challenges and pitfalls on surveying evidence in the software engineering technical literature: an exploratory study with novices

Download PDF

Talita Vieira Ribeiro¹,
Jobson Massollar¹ &
Guilherme Horta Travassos¹

834 Accesses
9 Citations
2 Altmetric
Explore all metrics

Abstract

The evidence-based software engineering approach advocates the use of evidence from empirical studies to support the decisions on the adoption of software technologies by practitioners in the software industry. To this end, many guidelines have been proposed to contribute to the execution and repeatability of literature reviews, and to the confidence of their results, especially regarding systematic literature reviews (SLR). To investigate similarities and differences, and to characterize the challenges and pitfalls of the planning and generated results of SLR research protocols dealing with the same research question and performed by similar teams of novice researchers in the context of the software engineering field. We qualitatively compared (using Jaccard and Kappa coefficients) and evaluated (using DARE) same goal SLR research protocols and outcomes undertaken by similar research teams. Seven similar SLR protocols regarding quality attributes for use cases executed in 2010 and 2012 enabled us to observe unexpected differences in their planning and execution. Even when the participants reached some agreement in the planning, the outcomes were different. The research protocols and reports allowed us to observe six challenges contributing to the divergences in the results: researchers’ inexperience in the topic, researchers’ inexperience in the method, lack of clearness and completeness of the papers, lack of a common terminology regarding the problem domain, lack of research verification procedures, and lack of commitment to the SLR. According to our findings, it is not possible to rely on results of SLRs performed by novices. Also, similarities at a starting or intermediate step during different SLR executions may not directly translate to the next steps, since non-explicit information might entail differences in the outcomes, hampering the repeatability and confidence of the SLR process and results. Although we do have expectations that the presence and follow-up of a senior researcher can contribute to increasing SLRs’ repeatability, this conclusion can only be drawn upon the existence of additional studies on this topic. Yet, systematic planning, transparency of decisions and verification procedures are key factors to guarantee the reliability of SLRs.

On the pragmatic design of literature studies in software engineering: an experience-based guideline

Article 06 January 2017

Practical relevance of software engineering research: synthesizing the community’s voice

Article Open access 05 March 2020

Benefitting from the Grey Literature in Software Engineering Research

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

Literature reviews serve as common starting points for most scientific research, including research in the Software Engineering (SE) field. Finding and reviewing previous studies or software technologies provides benefits for researchers regarding the identification of i) knowledge and new ideas about a topic; ii) research gaps and opportunities; and iii) related work. In industrial software scenarios, practitioners can take advantage of literature reviews to support the searching for software methods, processes, techniques, and tools, among other instruments suitable for their development contexts, which lower the risk of incorrect adoption decisions in their software development settings. However, ad-hoc literature reviews can threaten their own replication, coverage, and fairness, among other features.

Systematic literature reviews (SLRs) represent a more procedural and rigorous strategy to perform literature reviews. They define a set of steps to guide the scientific literature search, producing a repeatable research protocol, allowing critical judgment about the quality of the obtained knowledge and reducing bias related to outcomes (Biolchini et al. 2005; Kitchenham and Charters 2007). Quasi-systematic literature reviews (Travassos et al. 2008) and systematic mapping studies (Petersen et al. 2008) are also types of SLRs. The former does not support meta-analysis due to the lack of a baseline (comparison) for evidence aggregation. The latter focus on providing an overview of an area of interest, rather than aggregating evidence for a specific purpose.

As an investigation tool, the SLR strategy play a major role in the context of evidence-based software engineering (EBSE), which aims at providing an efficient way to integrate current scientific evidence with practical experience to support the decision making in SE (Dybå et al. 2005). SLR’s methodical processes for gathering, extraction, evaluation, and aggregation of evidence from various studies can assist the researchers in organizing a relevant and reliable body of knowledge regarding a specific research topic in academia. They can also assist the practitioners in finding software technologies suitable for their particular scenarios of software development in industry. As an example of the latter, Siemens Corporate Research supported the execution of an SLR aiming at investigating model-based software testing approaches (Dias Neto et al. 2007). Other examples of SLRs involving the industry can be seen in (Kasoju et al. 2013; López et al. 2015; Ulziit et al. 2015) and (Garousi et al. 2016) among others.

The importance and expected benefits of SLRs justify the concerns regarding their quality, since the topic under investigation; the experience of researchers or practitioners in both the research method and the topic; and the means and knowledge supporting the research questions answering can compromise the results (MacDonell et al. 2010). Therefore, some guidelines to undertake SLRs have been proposed over the years, such as (Biolchini et al. 2005; Kitchenham and Charters 2007; Petersen et al. 2015) and (Kuhrmann et al. 2017). Such guidelines provide recommendations aiming at reducing threats to the validity of SLRs by advising researchers and practitioners to explain the need for the SLR and to detail the research objectives and the plan that will support the study execution. Also, many investigations concerning the planning and execution of SLRs have been published in the technical literature. In this regard, some authors claim that SLRs are robust enough to resist execution deviations, producing stable outcomes for different processes (MacDonell et al. 2010). Notwithstanding, various researchers observed incompatibilities in results in SLRs with similar goals but executed by independent investigators (Kitchenham et al. 2011, 2012; Wohlin et al. 2013) and (Munir et al. 2014) – more details in Section 2.

In this context, at an International Software Engineering Research Network (ISERN) held in 2009, a group of ISERN members raised concerns regarding the possibility of conflicting results in SLRs. At that time, they were discussing the first SLR results in the SE field. They assumed that since an SLR protocol is supposed to be explicit, precise and unbiased, its outcomes should be either equal or alike to other results obtained by other researchers or practitioners executing (replicating) it or working with SLR protocols with similar purposes. It was stressed that knowledge and experience in the method play a major role in SLR planning and execution and they could lead to differences in the outcomes, indicating that an SLR might not be suitable for those players inexperienced in the method.

Out of these discussion arise a question: if the technical literature reports inconsistencies regarding SLRs executed by novice and even expert researchers, and the EBSE relies on research-based evidence through SLRs, how can SLRs conducted by practitioners – which usually are not much acquainted with this research method – be considered reliable? This way, our aim is to discuss SLRs reliability based on the following statement: “Similar SLR protocols, executed by similar teams of novice researchers, lead to equivalent answers (outcomes) to the same research question.” It is important to note that to some extent, graduate students (novice researchers) can present similar skills to practitioners, especially the less experienced ones, concerning planning and executing SLRs. About domain knowledge, practitioners may even be considered more experienced, but eventual differences in SE terminology adopted in the industry and academia can bring some difficulties to practitioners regarding SLR planning. That is, the domain knowledge may be insufficient to figure out adequate terms associated with a specific research question. A set of investigation questions were posed aiming at observing the statement above: What will happen if balanced groups of novice researchers (regarding their knowledge and experience in SLR planning and execution, and also in the research topic) plan and execute an SLR for the same research question? Should the research protocols be similar to each other, given that they address the same research question? Once similar SLR protocols are planned, should the selection of studies and reported outcomes be equal to each other, given the repeatable characteristic of SLRs? What do the differences between the planned SLRs and their results tell us about reliability (process repeatability and outcomes consistency)? How do players’ (lack of) knowledge and experience affect the SLRs reliability?

To investigate these questions, we planned and accomplished an exploratory study (detailed in Section 3) to analyze the planning, execution and outcomes of seven quasi-SLRs carried out by novice researchers (master and doctoral students) in the context of an Experimental Software Engineering (ESE) course in two distinct years – 2010 and 2012. The results presented in Section 4 indicate that i) when the same research question is addressed, different quasi-SLR protocols are planned; ii) when a similar point of view for the studies’ selection strategies is reported, divergent studies are selected; and iii) when the selected studies are the same, independent teams report different results. These discrepancies reinforce the perception that the difficulties faced by novice researchers in the planning and execution of SLRs impact the approach reliability and repeatability. Based on that, we can question whether the proposed and used guidelines by the academics to carry out SLRs are feasible to support novices in the context of academia and practitioners in the industry as well.

The remainder of this paper is organized as follows. In Section 5 we present the quasi-SLRs scores concerning the research protocols and reports as a way to identify the main issues faced by the participants while performing the assignment. Next, in Section 6 we discuss the challenges on surveying SE evidence with novices and the strategies they can adopt to make the SLRs suitable for those inexperienced in the method and in the topic under investigation, such as practitioners (especially concerning the former). The threats to the validity of this study are in Section 7 and the Conclusions in Section 8.

2 Related Works

Several studies report on the use of novice researchers performing SLRs in SE, and even though a couple of studies mention novice researchers can undertake SLRs, they represent one of the causes for results instability in SLRs. Definition of research questions; inclusion and exclusion criteria; and data extraction and synthesis are among the main difficulties faced by novice researchers while surveying evidence in the technical literature. However, difficulties in conducting systematic reviews can also be found when expert researchers conduct them. The next subsections provide an overview of different related works that i) used students to evaluate the applicability or reliability of SLRs; ii) compared independently published literature reviews and used feedback from experts to assess the research method quality and also the barriers encountered during its execution. A summary of their results is highlighted since we used some of them to support the planning of the exploratory study presented in this paper.

2.1 SLRs and Novices

In 2006, Rainer, Hall, and Badoo presented a preliminary investigation on undergraduate students’ experiences of using the EBSE approach while evaluating software technologies (Rainer et al. 2006). Overall, students had problems constructing EBSE questions, and they mainly based their questions on topics they had some experience with, for instance, programming languages to be used in their undergraduate assignments. One of their main difficulties was to formulate a question comparing software technologies. For example, the students formulated exploratory questions to identify all programming languages they could choose for their assignments, rather than developing a question to compare programming languages they were in doubt of choosing. The sources selected for collecting information to support their answers were not as expected since little scientific production was used to support their searching for information. Also, the students provided poor explanations concerning their search process, and they made different use of the available guidelines.

Oates and Capper tried to overcome some of the issues observed by Rainer, Hall, and Badoo. They carried out what they called a case study trying to answer questions related to the EBSE approach concerning its use by students (Oates and Capper 2009). They asked students to conduct an SLR on a topic of their interest and write a short essay on their experiences with the EBSE approach. The authors made some restrictions, though: they had given a question for the students to start working with; they had advised the students to search in scientific databases and to refine their search until they reached a set of articles in the range of 10 to 30. The analysis of the students’ marks supported the authors’ assumption that students could perform SLRs – at least upon the restrictions and guidance previously stated. The authors noticed that students need a more iterative surveying process in which they can refine their search strategy until they find works relevant to answering their research questions.

Even though Oates and Capper stated that students could perform SLRs, according to Riaz et al. their experience in conducting a complete systematic search for evidence can be quite different from experts (Riaz et al. 2010). In a study to gather the main difficulties faced by students while conducting SLRs, the authors identified issues related to building a search string that would retrieve a considerable number of papers without returning much noise; selecting appropriate works based solely on title and abstract; extracting the right amount of information from the selected works; synthesizing data that was not easily comparable; among others. While defining the research question can be challenging to both novices and experts, overall the former group faced more difficulties than the latter one.

Brereton could identify positive results in a study involving students conducting SLRs (Brereton 2011). In her case study, she observed that students were successful in undertaking most of the steps of the SLR process. The students’ performance was based on marks to their activities, and, in general, students with lower marks had problems with separating the planning information from the execution information. In summary, students succeed more in planning activities, even though they mentioned that the planning phase was the most difficult part of the SLR process.

Although all these studies concluded that students could be used to perform SLRs in ES, even though they have more difficulties than experts in performing the search, there are still some issues related to SLR completeness and repeatability that they did not evaluate. Kitchenham et al. (2011) presented a case study conducted to investigate the repeatability of the results provided by SLRs. Two research assistants (RAs) planned and conducted the same SLR topic, and their results were compared to each other. Their results were also compared with a previously published literature review on the same subject conducted by experienced researchers. Even though the same search period and libraries were used for all three SLRs, they reported different sets of primary studies for the same research topic. Kitchenham et al. conjectured that the lack of experience in the research topic and in the method, and the application of the inclusion/exclusion criteria can be the reasons for these differences.

More recently Carver et al. identified barriers to the SLR process (Carver et al. 2013). The authors gathered data from their experiences conducting SLRs, as well as from feedback of graduate students in an SLR course, and from authors of published SLRs. Among the most difficult tasks of the process are the ones related to selecting papers, extracting data and assessing the studies quality, and the most time-consuming tasks are the ones related to searching databases, choosing papers and extracting data. The authors’ findings suggest the need for careful SLR planning, especially concerning scoping the research questions, and defining the inclusion/exclusion criteria. Also, their study emphasizes the need for reviewing the whole planning (by experts) as well as taking advantage of teamwork to minimize bias and conflict resolution.

2.2 SLRs and Experts

Issues involving the use of SLRs in SE are not exclusive of students’ participation. In 2009 Babar and Zhang performed an interview-based survey to identify the perceptions of research practitioners on conducting SLRs in the SE field (Babar and Zhang 2009). The authors selected 24 researchers identified as active practitioners in SLR executions from which 17 agreed to respond to their interview. Apart from the positive perceptions regarding the research method, the researchers reported some of the most challenging things in SLRs which included the effort involved in the whole process, the design of search strings, and the definition of research questions.

More aligned with the work of assessing the reliability of SLRs, MacDonell et al. (2010) investigated the consistency of the SLR process and the stability of its outcomes. Their study compared the results of two independent reviews (performed by groups with similar domain experience) undertaken with a common research question. In comparison to the work presented by Kitchenham et al. (2011), the reviewers have vast experience in the research topic (cross-company estimation models and within-company estimation models). Although the two groups conducted the SLR in different ways (search strings, review process), the findings were similar (from 11 primary studies identified by the two groups, nine were commonly identified). The main causes of the differences have been designated as: i) a lack of consensus on what constitutes a high-quality primary study, and; ii) misunderstandings as to what constitutes an appropriate response variable. The conclusion of the study indicates the robustness of SLR as a research method (considering groups with similar domain experience), although its repeatability can be compromised.

In a participant-observer case study, Kitchenham et al. (2012) performed a mapping study of unit testing and regression testing to investigate the completeness of general mapping studies. They compared it with other specific mapping studies, SLRs and an expert literature review aiming at investigating how well general mapping studies identify clusters of related studies and to what extent such clusters are complete. The authors identified differences between the general systematic mapping they performed and the expert literature review regarding included papers, showing their mapping study outperformed the expert review. They also found that in comparison to SLRs and more accurate mapping studies, general mappings can miss important and relevant works. During the comparison, the authors identified issues related to differences in the classification of selected studies between the literature reviews, and also inconsistencies in the selection of studies which led the authors to advise the use of clear explanations during the exclusion of studies.

Another work that compared the results from independent literature reviews is the one presented by Wohlin et al. (2013). The authors present a study about two systematic mapping studies on the same research topic aiming at evaluating their reliability. Although the two studies address the same research topic, significant differences were identified regarding the inclusion and the categorization of papers, indicating low similarity between them. Based on that, the paper presents four conjectures to be confirmed or rejected through future investigations: i) snowballing based on researcher expertise and knowledge of an area is more efficient than trying to find optimal search strings; ii) secondary studies will not find the same papers unless it is a study of a relatively narrow area with experts in the area conducting the study; iii) secondary studies may come to the same general conclusions regarding an area even if the papers found are not the same, and; iv) secondary studies are not reliable per se; they rely heavily on the context of the secondary study.

In a more recent work, Hassler et al. presented a rank of barriers to the SLR process gathered from a community workshop (Hassler et al. 2014). Along with 37 composite obstacles to the SLR process, the authors also describe the impact of them on SLR methodology, researchers, authors, and consumers. Some of their findings share similarities with other previous studies, but new issues are also presented, such as the ones related to i) the presence of a sequential process for SLR instead of an iterative one; ii) the lack of support for interpretation and generalization of studies; iii) the misleading titles and abstracts; and iv) the lack of consistency of the SE terminology; among others.

The primary goal of this research is to characterize the reliability of SLRs by identifying similarities and differences in their processes and outcomes. Therefore, works such as (MacDonell et al. 2010; Kitchenham et al. 2011, 2012) and (Wohlin et al. 2013) are more closely related to the one presented in this paper. However, we decided not only to compare the included articles but also to compare search strings, inclusion/exclusion criteria, returned and excluded papers and also the outcome that was expected to answer the research question. Our expectation in applying this holistic view was to gather sources of comparison that would support us drawing a better conclusion on the points that make the SLR process more/less reliable. Also, we decided to provide instruments in our exploratory study to prevent the students from experiencing some of the difficulties previously mentioned. It allowed us to observe other challenges and pitfalls commonly faced by novices, as well as real problems with surveying evidence in the SE field that can be further used as a base to enhance the research on this topic.

3 The Exploratory Study Planning

Based on the previous discussions, this section presents the plan of our exploratory study on the reliability of SLR processes in the SE field. Detailed information on the materials and data collected during the study can be found in our study package available at http://lens-ese.cos.ufrj.br/appendices/EMSE/2016/StudyPackage.zip.

3.1 Goal

The research objective was set using the Goal-Question-Metric (GQM) template (Basili 1992), as Table 1 depicts.

Table 1 Research objective

Challenges and pitfalls on surveying evidence in the software engineering technical literature: an exploratory study with novices

Abstract

Similar content being viewed by others

On the pragmatic design of literature studies in software engineering: an experience-based guideline

Practical relevance of software engineering research: synthesizing the community’s voice

Benefitting from the Grey Literature in Software Engineering Research

Explore related subjects

1 Introduction

2 Related Works

2.1 SLRs and Novices

2.2 SLRs and Experts

3 The Exploratory Study Planning

3.1 Goal

3.2 Participants

3.3 Materials

3.4 Research Question and Assumptions on SLR Reliability

3.5 Tasks and Procedures

3.6 Analysis Procedure

3.6.1 Research Protocol Similarity Analysis

Syntactic Perspective

Semantic Perspective

3.6.2 Outcomes Similarity Analysis

4 Study Results

4.1 Same Research Question and Different Protocols: Syntactic Perspective Analysis

4.2 Same Research Question and Different Protocols: Semantic Perspective Analysis

4.3 Same Studies and Different Outcomes

4.4 Study Conclusion

5 quasi-SLR Research Protocols and Report Grades: Quality Assessment

5.1 Assessing the Protocols and Reports through a DARE Criteria Adaptation

5.2 Evaluating the Searches through Precision and Recall Analysis

6 Challenges and Pitfalls on SLR Planning and Execution in the SE Field

6.1 Lack of Experience in the Topic

6.2 Lack of Experience in the Method

6.3 Lack of Clearness and Completeness of the Papers

6.4 Lack of a Common SE Terminology

6.5 Lack of Verification Procedures

6.6 Lack of Commitment to the SLR

7 Threats to Validity

8 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix 1

Appendix 2

Appendix 3

Appendix 4

Appendix 5

Appendix 6

Appendix 7

Appendix 8

Appendix 9

Appendix 10

Appendix 11

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation