1 Introduction

Literature reviews serve as common starting points for most scientific research, including research in the Software Engineering (SE) field. Finding and reviewing previous studies or software technologies provides benefits for researchers regarding the identification of i) knowledge and new ideas about a topic; ii) research gaps and opportunities; and iii) related work. In industrial software scenarios, practitioners can take advantage of literature reviews to support the searching for software methods, processes, techniques, and tools, among other instruments suitable for their development contexts, which lower the risk of incorrect adoption decisions in their software development settings. However, ad-hoc literature reviews can threaten their own replication, coverage, and fairness, among other features.

Systematic literature reviews (SLRs) represent a more procedural and rigorous strategy to perform literature reviews. They define a set of steps to guide the scientific literature search, producing a repeatable research protocol, allowing critical judgment about the quality of the obtained knowledge and reducing bias related to outcomes (Biolchini et al. 2005; Kitchenham and Charters 2007). Quasi-systematic literature reviews (Travassos et al. 2008) and systematic mapping studies (Petersen et al. 2008) are also types of SLRs. The former does not support meta-analysis due to the lack of a baseline (comparison) for evidence aggregation. The latter focus on providing an overview of an area of interest, rather than aggregating evidence for a specific purpose.

As an investigation tool, the SLR strategy play a major role in the context of evidence-based software engineering (EBSE), which aims at providing an efficient way to integrate current scientific evidence with practical experience to support the decision making in SE (Dybå et al. 2005). SLR’s methodical processes for gathering, extraction, evaluation, and aggregation of evidence from various studies can assist the researchers in organizing a relevant and reliable body of knowledge regarding a specific research topic in academia. They can also assist the practitioners in finding software technologies suitable for their particular scenarios of software development in industry. As an example of the latter, Siemens Corporate Research supported the execution of an SLR aiming at investigating model-based software testing approaches (Dias Neto et al. 2007). Other examples of SLRs involving the industry can be seen in (Kasoju et al. 2013; López et al. 2015; Ulziit et al. 2015) and (Garousi et al. 2016) among others.

The importance and expected benefits of SLRs justify the concerns regarding their quality, since the topic under investigation; the experience of researchers or practitioners in both the research method and the topic; and the means and knowledge supporting the research questions answering can compromise the results (MacDonell et al. 2010). Therefore, some guidelines to undertake SLRs have been proposed over the years, such as (Biolchini et al. 2005; Kitchenham and Charters 2007; Petersen et al. 2015) and (Kuhrmann et al. 2017). Such guidelines provide recommendations aiming at reducing threats to the validity of SLRs by advising researchers and practitioners to explain the need for the SLR and to detail the research objectives and the plan that will support the study execution. Also, many investigations concerning the planning and execution of SLRs have been published in the technical literature. In this regard, some authors claim that SLRs are robust enough to resist execution deviations, producing stable outcomes for different processes (MacDonell et al. 2010). Notwithstanding, various researchers observed incompatibilities in results in SLRs with similar goals but executed by independent investigators (Kitchenham et al. 2011, 2012; Wohlin et al. 2013) and (Munir et al. 2014) – more details in Section 2.

In this context, at an International Software Engineering Research Network (ISERN) held in 2009, a group of ISERN members raised concerns regarding the possibility of conflicting results in SLRs. At that time, they were discussing the first SLR results in the SE field. They assumed that since an SLR protocol is supposed to be explicit, precise and unbiased, its outcomes should be either equal or alike to other results obtained by other researchers or practitioners executing (replicating) it or working with SLR protocols with similar purposes. It was stressed that knowledge and experience in the method play a major role in SLR planning and execution and they could lead to differences in the outcomes, indicating that an SLR might not be suitable for those players inexperienced in the method.

Out of these discussion arise a question: if the technical literature reports inconsistencies regarding SLRs executed by novice and even expert researchers, and the EBSE relies on research-based evidence through SLRs, how can SLRs conducted by practitioners – which usually are not much acquainted with this research method – be considered reliable? This way, our aim is to discuss SLRs reliability based on the following statement: “Similar SLR protocols, executed by similar teams of novice researchers, lead to equivalent answers (outcomes) to the same research question.” It is important to note that to some extent, graduate students (novice researchers) can present similar skills to practitioners, especially the less experienced ones, concerning planning and executing SLRs. About domain knowledge, practitioners may even be considered more experienced, but eventual differences in SE terminology adopted in the industry and academia can bring some difficulties to practitioners regarding SLR planning. That is, the domain knowledge may be insufficient to figure out adequate terms associated with a specific research question. A set of investigation questions were posed aiming at observing the statement above: What will happen if balanced groups of novice researchers (regarding their knowledge and experience in SLR planning and execution, and also in the research topic) plan and execute an SLR for the same research question? Should the research protocols be similar to each other, given that they address the same research question? Once similar SLR protocols are planned, should the selection of studies and reported outcomes be equal to each other, given the repeatable characteristic of SLRs? What do the differences between the planned SLRs and their results tell us about reliability (process repeatability and outcomes consistency)? How do players’ (lack of) knowledge and experience affect the SLRs reliability?

To investigate these questions, we planned and accomplished an exploratory study (detailed in Section 3) to analyze the planning, execution and outcomes of seven quasi-SLRs carried out by novice researchers (master and doctoral students) in the context of an Experimental Software Engineering (ESE) course in two distinct years – 2010 and 2012. The results presented in Section 4 indicate that i) when the same research question is addressed, different quasi-SLR protocols are planned; ii) when a similar point of view for the studies’ selection strategies is reported, divergent studies are selected; and iii) when the selected studies are the same, independent teams report different results. These discrepancies reinforce the perception that the difficulties faced by novice researchers in the planning and execution of SLRs impact the approach reliability and repeatability. Based on that, we can question whether the proposed and used guidelines by the academics to carry out SLRs are feasible to support novices in the context of academia and practitioners in the industry as well.

The remainder of this paper is organized as follows. In Section 5 we present the quasi-SLRs scores concerning the research protocols and reports as a way to identify the main issues faced by the participants while performing the assignment. Next, in Section 6 we discuss the challenges on surveying SE evidence with novices and the strategies they can adopt to make the SLRs suitable for those inexperienced in the method and in the topic under investigation, such as practitioners (especially concerning the former). The threats to the validity of this study are in Section 7 and the Conclusions in Section 8.

2 Related Works

Several studies report on the use of novice researchers performing SLRs in SE, and even though a couple of studies mention novice researchers can undertake SLRs, they represent one of the causes for results instability in SLRs. Definition of research questions; inclusion and exclusion criteria; and data extraction and synthesis are among the main difficulties faced by novice researchers while surveying evidence in the technical literature. However, difficulties in conducting systematic reviews can also be found when expert researchers conduct them. The next subsections provide an overview of different related works that i) used students to evaluate the applicability or reliability of SLRs; ii) compared independently published literature reviews and used feedback from experts to assess the research method quality and also the barriers encountered during its execution. A summary of their results is highlighted since we used some of them to support the planning of the exploratory study presented in this paper.

2.1 SLRs and Novices

In 2006, Rainer, Hall, and Badoo presented a preliminary investigation on undergraduate students’ experiences of using the EBSE approach while evaluating software technologies (Rainer et al. 2006). Overall, students had problems constructing EBSE questions, and they mainly based their questions on topics they had some experience with, for instance, programming languages to be used in their undergraduate assignments. One of their main difficulties was to formulate a question comparing software technologies. For example, the students formulated exploratory questions to identify all programming languages they could choose for their assignments, rather than developing a question to compare programming languages they were in doubt of choosing. The sources selected for collecting information to support their answers were not as expected since little scientific production was used to support their searching for information. Also, the students provided poor explanations concerning their search process, and they made different use of the available guidelines.

Oates and Capper tried to overcome some of the issues observed by Rainer, Hall, and Badoo. They carried out what they called a case study trying to answer questions related to the EBSE approach concerning its use by students (Oates and Capper 2009). They asked students to conduct an SLR on a topic of their interest and write a short essay on their experiences with the EBSE approach. The authors made some restrictions, though: they had given a question for the students to start working with; they had advised the students to search in scientific databases and to refine their search until they reached a set of articles in the range of 10 to 30. The analysis of the students’ marks supported the authors’ assumption that students could perform SLRs – at least upon the restrictions and guidance previously stated. The authors noticed that students need a more iterative surveying process in which they can refine their search strategy until they find works relevant to answering their research questions.

Even though Oates and Capper stated that students could perform SLRs, according to Riaz et al. their experience in conducting a complete systematic search for evidence can be quite different from experts (Riaz et al. 2010). In a study to gather the main difficulties faced by students while conducting SLRs, the authors identified issues related to building a search string that would retrieve a considerable number of papers without returning much noise; selecting appropriate works based solely on title and abstract; extracting the right amount of information from the selected works; synthesizing data that was not easily comparable; among others. While defining the research question can be challenging to both novices and experts, overall the former group faced more difficulties than the latter one.

Brereton could identify positive results in a study involving students conducting SLRs (Brereton 2011). In her case study, she observed that students were successful in undertaking most of the steps of the SLR process. The students’ performance was based on marks to their activities, and, in general, students with lower marks had problems with separating the planning information from the execution information. In summary, students succeed more in planning activities, even though they mentioned that the planning phase was the most difficult part of the SLR process.

Although all these studies concluded that students could be used to perform SLRs in ES, even though they have more difficulties than experts in performing the search, there are still some issues related to SLR completeness and repeatability that they did not evaluate. Kitchenham et al. (2011) presented a case study conducted to investigate the repeatability of the results provided by SLRs. Two research assistants (RAs) planned and conducted the same SLR topic, and their results were compared to each other. Their results were also compared with a previously published literature review on the same subject conducted by experienced researchers. Even though the same search period and libraries were used for all three SLRs, they reported different sets of primary studies for the same research topic. Kitchenham et al. conjectured that the lack of experience in the research topic and in the method, and the application of the inclusion/exclusion criteria can be the reasons for these differences.

More recently Carver et al. identified barriers to the SLR process (Carver et al. 2013). The authors gathered data from their experiences conducting SLRs, as well as from feedback of graduate students in an SLR course, and from authors of published SLRs. Among the most difficult tasks of the process are the ones related to selecting papers, extracting data and assessing the studies quality, and the most time-consuming tasks are the ones related to searching databases, choosing papers and extracting data. The authors’ findings suggest the need for careful SLR planning, especially concerning scoping the research questions, and defining the inclusion/exclusion criteria. Also, their study emphasizes the need for reviewing the whole planning (by experts) as well as taking advantage of teamwork to minimize bias and conflict resolution.

2.2 SLRs and Experts

Issues involving the use of SLRs in SE are not exclusive of students’ participation. In 2009 Babar and Zhang performed an interview-based survey to identify the perceptions of research practitioners on conducting SLRs in the SE field (Babar and Zhang 2009). The authors selected 24 researchers identified as active practitioners in SLR executions from which 17 agreed to respond to their interview. Apart from the positive perceptions regarding the research method, the researchers reported some of the most challenging things in SLRs which included the effort involved in the whole process, the design of search strings, and the definition of research questions.

More aligned with the work of assessing the reliability of SLRs, MacDonell et al. (2010) investigated the consistency of the SLR process and the stability of its outcomes. Their study compared the results of two independent reviews (performed by groups with similar domain experience) undertaken with a common research question. In comparison to the work presented by Kitchenham et al. (2011), the reviewers have vast experience in the research topic (cross-company estimation models and within-company estimation models). Although the two groups conducted the SLR in different ways (search strings, review process), the findings were similar (from 11 primary studies identified by the two groups, nine were commonly identified). The main causes of the differences have been designated as: i) a lack of consensus on what constitutes a high-quality primary study, and; ii) misunderstandings as to what constitutes an appropriate response variable. The conclusion of the study indicates the robustness of SLR as a research method (considering groups with similar domain experience), although its repeatability can be compromised.

In a participant-observer case study, Kitchenham et al. (2012) performed a mapping study of unit testing and regression testing to investigate the completeness of general mapping studies. They compared it with other specific mapping studies, SLRs and an expert literature review aiming at investigating how well general mapping studies identify clusters of related studies and to what extent such clusters are complete. The authors identified differences between the general systematic mapping they performed and the expert literature review regarding included papers, showing their mapping study outperformed the expert review. They also found that in comparison to SLRs and more accurate mapping studies, general mappings can miss important and relevant works. During the comparison, the authors identified issues related to differences in the classification of selected studies between the literature reviews, and also inconsistencies in the selection of studies which led the authors to advise the use of clear explanations during the exclusion of studies.

Another work that compared the results from independent literature reviews is the one presented by Wohlin et al. (2013). The authors present a study about two systematic mapping studies on the same research topic aiming at evaluating their reliability. Although the two studies address the same research topic, significant differences were identified regarding the inclusion and the categorization of papers, indicating low similarity between them. Based on that, the paper presents four conjectures to be confirmed or rejected through future investigations: i) snowballing based on researcher expertise and knowledge of an area is more efficient than trying to find optimal search strings; ii) secondary studies will not find the same papers unless it is a study of a relatively narrow area with experts in the area conducting the study; iii) secondary studies may come to the same general conclusions regarding an area even if the papers found are not the same, and; iv) secondary studies are not reliable per se; they rely heavily on the context of the secondary study.

In a more recent work, Hassler et al. presented a rank of barriers to the SLR process gathered from a community workshop (Hassler et al. 2014). Along with 37 composite obstacles to the SLR process, the authors also describe the impact of them on SLR methodology, researchers, authors, and consumers. Some of their findings share similarities with other previous studies, but new issues are also presented, such as the ones related to i) the presence of a sequential process for SLR instead of an iterative one; ii) the lack of support for interpretation and generalization of studies; iii) the misleading titles and abstracts; and iv) the lack of consistency of the SE terminology; among others.

The primary goal of this research is to characterize the reliability of SLRs by identifying similarities and differences in their processes and outcomes. Therefore, works such as (MacDonell et al. 2010; Kitchenham et al. 2011, 2012) and (Wohlin et al. 2013) are more closely related to the one presented in this paper. However, we decided not only to compare the included articles but also to compare search strings, inclusion/exclusion criteria, returned and excluded papers and also the outcome that was expected to answer the research question. Our expectation in applying this holistic view was to gather sources of comparison that would support us drawing a better conclusion on the points that make the SLR process more/less reliable. Also, we decided to provide instruments in our exploratory study to prevent the students from experiencing some of the difficulties previously mentioned. It allowed us to observe other challenges and pitfalls commonly faced by novices, as well as real problems with surveying evidence in the SE field that can be further used as a base to enhance the research on this topic.

3 The Exploratory Study Planning

Based on the previous discussions, this section presents the plan of our exploratory study on the reliability of SLR processes in the SE field. Detailed information on the materials and data collected during the study can be found in our study package available at http://lens-ese.cos.ufrj.br/appendices/EMSE/2016/StudyPackage.zip.

3.1 Goal

The research objective was set using the Goal-Question-Metric (GQM) template (Basili 1992), as Table 1 depicts.

Table 1 Research objective

We intend to investigate the SLR process repeatability and outcome consistency based on the similarities and differences encountered in the research protocols and reports of seven SLRs dealing with the same research question and performed by similar teams of novice researchers (concerning mainly their inexperience in the research method).

3.2 Participants

The studies were executed during two years (2010 and 2012) in the ESE course at COPPE/UFRJ. The participants were graduate students (seven D.Sc. and 14 M.Sc.) in their first year of graduation (only taking disciplines at this period) in the System Engineering and Computer Science Program, and none of them had previous experience in the experimental topics taught in the course (Primary and Secondary Studies in SE), as can be seen in Fig. 1. The secondary study planning and execution were assignments given to the students – the main assignments used to grade the students in the module. We can highlight two main motivations for the students to participate in the study and be committed to it: i) first, many masters and doctorate students had expectations in executing an SLR in the context of their research (dissertations and thesis), which became true in many cases (see Table 2); ii) second, since the students were being marked on the assignment, they had to apply themselves in order not to fail the module. Otherwise, it could cost their standing in the graduation program.

Fig. 1
figure 1

Participants experience in the main topics related to the assignment

Table 2 Teams’ names and characteristics

We organized seven teams (three in 2010 and four in 2012) with three participants each aiming at reducing communication gaps and problems with the course commitment. Members of the Experimental Software Engineering Group at COPPE/UFRJ attending the course were grouped, and part-time participants were either placed in the same group or scattered consistently among the teams. Other characteristics such as the perceived knowledge (observed during the classes), declared knowledge (responses to a characterization form) of the topic under investigation – use cases – and academic experience were also used as drivers to organize the teams. For instance, no team was composed only of doctoral students, and all teams had participants with expertise in SE in practice (practitioners). It is important to emphasize that all participants had previous knowledge and, in some cases, expertise on use case descriptions either in academia or industry. Table 2 summarizes some of the characteristics of each team, including whether the participants conducted an SLR in the context of their research (see our study package for more information at http://lens-ese.cos.ufrj.br/appendices/EMSE/2016/StudyPackage.zip).

We are aware that ensuring two teams of researchers have similar knowledge and expertise (on the research method and topic) is laborious and subjective. Furthermore, the characteristics used to assess these features might not be sufficient to guarantee such assumption. Therefore, the participants were always grouped as much as possible guaranteeing similarities among the teams and ensuring a real commitment during the SLR planning and execution. The constraint of having at least one participant with experience in the software industry in each team was also a way to simulate a scenario in which a practitioner would perform an SLR.

3.3 Materials

To support this exploratory study, we prepared and used three materials: i) a consent form (written in Portuguese); ii) a characterization form (written in Portuguese), and iii) an initial research protocol (written in English). The students were not obligated to participate in the study, and for this reason, all of them received the consent form and were asked to sign it in the case of agreeing in taking part of the study. They knew they would be graded on the study and an alternative form of evaluation would be given to those that would not consent on participating in it. All of them signed the consent form.

After agreeing on the study, the participants filled in a characterization form, self-reporting their knowledge and experience (using a Likert-scale) in the following topics: English reading and comprehension, software development, requirements and use case, primary and secondary studies and quality appraisal of software artifacts and scientific papers. The stratification of the participants in different teams used this particular form.

The initial research protocol is the most important instrument of this study since it contains the main elements to guide the students on performing the SLR in the same topic. It contains the research question the teams should answer: “Which quality attributes (and measurements used to evaluate such attributes) have been empirically studied for use cases?” The topic related to the use cases was suggested in the 2009 ISERN meeting, and it was used in our study because it is believed to be a grounded topic in the SE field in which the participants would have more knowledge and experience. Also, the participants would even feel more comfortable to work with it – which was the case of our study according to the characterization form responses.

Along with the research question, the following information was also provided in the initial research protocol:

  1. (i)

    background information and perspectives of quality attributes regarding requirements specification, as presented in (Condori-Fernandez et al. 2009);

  2. (ii)

    a request to extract from selected studies the approaches, templates or formats proposed to improve the use case quality;

  3. (iii)

    some initial terms to support the search for studies;

  4. (iv)

    definition of the search engines to be employed in the study – Scopus, Web of Science and IEEE Xplore;

  5. (v)

    some initial criteria for the studies selection and evaluation, and;

  6. (vi)

    a data extraction form suggestion.

The idea behind providing all this information to the students was to place them in the same perspective concerning the quality of use cases and also to prevent the main problems reported and highlighted in the related works. It is important to notice that, despite making available this information set, the teams should complete the SLR protocol and had the freedom to change some items, except for the research question.

3.4 Research Question and Assumptions on SLR Reliability

Driven by our main research question – do similar SLR protocols, executed by similar teams of novice researchers, lead to similar answers to the same research question? –, some behaviors concerning the SLR planning and outcome can be conjectured, as presented in Table 3.

Table 3 Protocol and outcome similarities alternatives

Since SLRs provide a well-defined procedure to identify, analyze and interpret impartially and repetitively all kind of available evidence related to a specific research question (Biolchini et al. 2005) two of the behaviors presented in Table 1 are naturally expected to happen, especially considering the similarity of researchers’ knowledge and experience executing the reviews. Yet, whether two SLR protocols are alike, and their execution (in terms of studies selection) and/or outcomes (in terms of answers to the research question) turn out to be different, it might show that either some external factors influenced the selection of studies and the analysis of the results (e.g., existence of ambiguous information and various terminologies in the studies) or relevant information is missing from the research protocols. These issues hamper the SLR process repeatability and, thus, its reliability, as it might have been the cases reported in, (Kitchenham et al. 2011, 2012; Wohlin et al. 2013) and (Munir et al. 2014).

Conversely, if two SLR protocols are different and their outcomes turn out to be similar, this can reveal the existence of a similar terminology used to report the results and/or a similar researchers’ point of view about the topic under investigation. The point of views can be expressed not exactly by the terms of the search strings and adopted selection criteria but by the intention of the search and selection of studies presented in these two elements (showing that some parts of the research protocol are particularly more important than others). It might have been the case reported in (MacDonell et al. 2010) concluding that SLRs are reliable since they can result in similar answers to the same research question even in the face of differences in their investigation processes.

3.5 Tasks and Procedures

All participants received equivalent lectures on SLRs and had about two months to execute the assignment and present the quasi-SLR results. The lectures involved topics related to primary and secondary studies in SE, and a secondary study planning and execution was one of the assignments given to the students – the main assignment used to grade the students in the module.

We asked the students to use the available guidelines for secondary studies executions to guide them in the planning and execution of their studies, and at any time they could ask questions on the SLR steps. Although the initial research protocol has been provided, the participants were free to fulfill it according to their understanding of the topic under investigation, as long as they would not modify the research question for the search engines (to not make us lose the baseline of comparison). To support the protocol refinement, especially the search string formulation using the Population-Intervention-Comparison-Outcome (PICO) strategy (Pai et al. 2004) we also advised them to identify control articles and to improve their search string based on them. As additional requests for the assignment, the students should use JabRefFootnote 1 for supporting the studies selection and data extraction, and should provide three main deliveries: i) the updated quasi-SLR plan; ii) the Bibtex featuring the studies selection and data extraction; iii) the complete quasi-SLR package which should include the final version of the research protocol, Bibtex, included/excluded papers and reports with the quality attributes for use cases extracted from the included articles.

3.6 Analysis Procedure

3.6.1 Research Protocol Similarity Analysis

We defined two similarity perspectives to analyze the agreement among the SLR plans: syntactic and semantic. In the syntactic perspective, we want to observe the similarity between pairs of reviews regarding their search in the digital libraries. In the semantic perspective, we want to observe the similarity between pairs of reviews regarding participants’ point of views about the research question. The following subsections present details about these two similarity perspectives.

Syntactic Perspective

In this point of view, we want to observe the exact match between pairs of reviews in finding the same papers. To do so, we selected the Jaccard index – Eq. (1) (Jaccard 1912) – as a measure of similarity of two protocols (A and B) for two units of analysis: adopted search terms and papers returned in common, as described below:

$$ J\left(A,B\right)=\frac{\left|A\cap B\right|}{\left|A\cup B\right|}=\frac{\left|A\cap B\right|}{\left|A\right|+\left|B\right|-\left|A\cap B\right|} $$
(1)

In respect to the adopted terms in the search strings of two protocols (A and B), the Jaccard index expresses the portion of common terms between them (|A∩B|) in relation to the total of terms used in protocol A (|A|) plus the total of terms used in protocol B (|B|) excluding the common ones. To compare the similarity of two terms we had to apply some rules: i) we used the main search string (see Appendix 2) created by each team instead of using the three search strings tailored to each search engine; ii) we did not consider the use of quotation marks, that is, a term with or without them would be equivalent; iii) the terms were considered case insensitive; iv) we considered singular and plural forms (just the ones that add ‘s’ at the end of a word) of a search term as equivalent; v) we considered the use of hyphens, that is, a search term with hyphen would be regarded as different from its similar without the hyphen. Since these rules simplify the search terms comparison, and the logics of the search strings were not considered in the described similarity calculation, it is wise to analyze the similarity of the adopted search terms along with the similarity of the returned papers, decreasing the threats to the validity of the syntactic perspective analysis.

Regarding the papers returned in common, the aim is to identify the portion of papers returned by both search strings (|A∩B|) in relation to all returned papers by the pair A (|A|) and B (|B|) (not considering duplicates – |A∩B|). As we will compare SLRs executed in different years, only returned papers up to 2010 should be discussed in the comparison of pairs of protocols from distinct years.

The analysis of the values distribution for each unit of analysis will support the identification of slight similarities (below the first quartile of the distribution) and almost perfect similarity (above upper third quartile of the distribution). A complete similarity is represented by 1.0.

Semantic Perspective

In this point of view, we want to observe the similarities among the teams’ intentions in searching and accepting the same papers, that is, their point of view about the research question. The units of analysis, in this case, are the main concepts embedded in the search string terms; the paper inclusion and exclusion criteria; and the included and excluded papers. Similarly to the syntactic perspective analysis, the semantic perspective one should take into consideration all mentioned units of analysis to support more reliable conclusions regarding the teams’ similar/different intentions in searching and accepting the same papers.

To identify the main concepts (Appendix 3) embedded in the search strings, we needed to abstract them from the terms used in each research protocol by applying a coding technique similar to the open coding provided by Grounded Theory (Corbin and Strauss 2007). One of the authors assembled and sorted alphabetically all the terms used in the seven search strings (Appendix 2), ignoring the logical structure of the search strings and aggregating the same name terms (by applying the rules mentioned in the previous subsection). Next, during a three-hour session, the three authors got together to identify the main concept of each term. For each of the 366 different search terms, the authors assessed its meaning based on the semantics of its words in the SE field and assigned a concept to it. Whenever a new concept was identified, we compared it to the existing concepts, avoiding the creation of different concepts with the same meaning. Overall 23 different concepts were identified and the same Jaccard index – Eq. (1) – could be used to measure the semantic similarity among teams in means of the concepts abstracted from their adopted search terms.

Concerning the paper inclusion and exclusion criteria, the semantic similarity can also be measured using the Jaccard index by checking the proportion of inclusion and exclusion criteria each pair of research protocols share. The last unit of analysis (included and excluded papers) is the one that relates the most to the teams’ points of view about the research question, and it can be observed in the light of the teams’ agreements and disagreements in including/excluding papers for data extraction. The Kappa coefficient – Eq. (2) (Cohen 1960) – can support the measurement of this feature, once it is used to measure the agreement in qualitative evaluations among different raters. In subjective interpretations, two observers will sometimes agree or disagree by chance; once no objective criterion is stated (Viera and Garrett 2005). Kappa coefficient intends to calculate the qualitative agreement among raters subtracting the probability the agreement might have happened by chance. To do so, it takes into account the relative agreement of raters (two teams in our case) in each of the analyzed categories (included and excluded papers in our case – qualitative perspective) – po – and the probability the agreement has happened by chance – pe, subtracting pe of po, as follows:

$$ K=\frac{\left( po- pe\right)}{\left(1- pe\right)}=1-\frac{\left(1- po\right)}{\left(1- pe\right)} $$
(2)

Although we could have used the Jaccard index presented previously to characterize the agreement on the inclusion and exclusion of papers, the Kappa coefficient is more robust to measure the agreement when making a qualitative evaluation, since it does not consider the agreement by chance – detailed information in (Viera and Garrett 2005). The consideration of papers published only up to 2010 for comparison of pairs of protocols from different years is also required in this case.

We used the work by Vieira and Garrett to identify the level of agreement for the obtained Kappa values, once it is commonly applied for this purpose (Viera and Garrett 2005). The confidence interval used was 95%. A perfect agreement is represented by 1.0.

In the end, two quasi-SLR protocols are considered similar if their syntactic and semantic perspectives have almost perfect similarity and agreement. Slight similarity and agreement emphasize that the protocols are quite different.

3.6.2 Outcomes Similarity Analysis

We can analyze the answers to the research question the teams can provide – quality attributes for use cases – upon two perspectives: the answers are correct, complete and consistent among the seven SLRs, or they are incorrect, incomplete and inconsistent, and need to be revised and detailed so we can identify their actual match among the reviews, and, thus, perform the similarity analysis. In this regard, we decided to consider all answers as correct, complete and consistent. Otherwise, we would have to go through all the included papers from the seven SLRs and revise the teams’ data extractions, which would result in a comparison of the authors’ answers, not the teams’ ones. Thus, we compared the quality attributes according to their syntax only; assuming whether the syntax is equivalent so is the meaning. We understand that this decision can make us overlook some answers particularities that would prevent us from matching the same name attributes with different meanings reported in various reviews; or even would make us match different name attributes with similar meanings. However, we made such decision to avoid biasing the quasi-SLR outcomes and teams’ perspectives. The similarity of two answers (sets of quality attributes) is then calculated using the Jaccard index as we did for the search terms, returned papers, main concepts, and inclusion/exclusion criteria.

It is important to stress that even though the strategy used for extracting the main concepts from search terms can also be applied to the quality attributes for the use case, it would not result in a diversified group of concepts as happened previously since the scope, in this case, is narrower when compared to the adopted search terms.

4 Study Results

In this section, we report the similarities and agreements observed among the seven quasi-SLR research protocols and outcomes, highlighting some pitfalls (underlined throughout the section) that support the explanation for the observed divergent results. Figure 2 presents an overview of the teams’ quantitative results divided by the protocols and outcomes similarity analysis perspectives – which will guide the report of this study results. As one can see, the selected search terms ranged from 11 terms used by the Black team and 215 terms by the Purple, while the returned papers ranged from 157 papers came back in the Pink search and 661 in the Black search. These different outcomes foretell the findings presented in this section.

Fig. 2
figure 2

Summary of teams’ quantitative results

4.1 Same Research Question and Different Protocols: Syntactic Perspective Analysis

Since the same research question has been addressed (with minor differences as it can be seen in Appendix 1) and an initial protocol was given to all the teams to ground their knowledge in the research topic and method, we expected some similarity among the research protocols. Surprisingly, we could not observe this behavior. Table 4 presents the Jaccard index (expressed in percentage) for the terms used in each pair of search strings.

Table 4 Syntactic Perspective - Jaccard index (in %) for terms used in common between each SLR pair

Although the pairs Red-Green, and Pink-Yellow present the highest similarity index for terms in the search string, it is rather naive to assume any similarity given the value slightly above of 18% for their common terms. In the Pink-Yellow pair, the teams’ characteristics might help us to explain this proximity when compared to the other teams. ESE group members mainly composed both the Pink and Yellow teams. In this case, almost a third (six out of 19) of similarities lie in the terms they usually used to search for empirical studies in SE – terms that they were more familiarized than the other participants due to their daily research activities in their research group.

The Blue had noticeable divergence with almost all other teams (Fig. 3). A detailed analysis of the terms used in its search string (Appendix 2) reveals that the participants had preferred to use general over specific terms (keywords too high-level). For instance, instead of “quality attributes” and “quality characteristics” chosen by other teams, the Blue team decided for using “attribute,” “characteristic” and “quality” in its search. Although its quest would also return the papers that present “quality attributes” or “quality characteristics,” we did not consider its terms as equal to the others during the comparison of the terms, since the other teams would not find the same papers returned by the Blue’s search. An interesting observation on the identified search terms is that all teams but the Red chose to use words with no impact in the searches (unnecessary search terms); that is, they included plural terms and their singular versions, or compound terms that were already covered by simpler terms previously identified. As an example of these cases, some teams used “use case” and “use cases” in the same search string, or even “description template” and “template,” among other examples.

Fig. 3
figure 3

Summary of the amount of terms in common among the teams

Overall, the teams identified 366 distinct terms: no term was mentioned in all the seven search strings, three (“software development”, “consistency” and “understandability”) were used by six teams, eight (“system development”, “use case”, “quality characteristic”, “quality factor”, “quality feature”, “completeness”, “correctness”, and “efficiency”) by five teams, and three (“case study”, “software project”, and “quality attribute”) by four teams. The remaining terms were used by less than half of the teams. From the search strings (see Appendix 3) we could notice that many teams tried to maximize the number of terms combination, disregarding whether they were valid search terms (many different combinations of terms producing noise return). Also, some terms had no relation to the research question, such as “testwarehouse”, “program method”, and “degree of functional encapsulation”, not to mention other not typical terms for search, such as “desirable quality”, “mistake free”, and “wholeness” (inappropriate selection of search terms), stressing the difficulties in creating a search string to meet a research purpose.

As explained in Section 3.6.1, the analysis of similarities through the syntactic perspective should be done considering not only the terms used for the search but also the returned papers to take into account the logical expression of the search strings. The syntactic similarities of the returned papers were even lower than the similarity registered for the terms (Table 5).

Table 5 Syntactic Perspective - Jaccard index (in %) for papers returned in common between each SLR pair

Aside from the differences in the logic of search strings (overuse of ‘and’ operators and distributive properties), common terms previously identified by the teams were differently organized in the search strings, even though all of the participants were instructed to use the PICO strategy (Pai et al. 2004) (misuse of the guidelines). Another observed detail regards how differently the teams configured the search engines, bounding the areas that should be excluded from the search (see Appendix 4). These differences have certainly affected the results provided by the search engines, explaining the lower percentages in Table 5.

As we can observe, no noticeable similarity can be seen from the syntactic perspective even though the same research question and an initial research protocol were given to the teams.

4.2 Same Research Question and Different Protocols: Semantic Perspective Analysis

The semantic perspective can help us to understand whether the teams’ point of views have influenced the differences in the syntactic perspective and whether similar findings can be observed in the existing, though low, similarity. Out of the 366 distinct terms used in all seven quasi-SLR search strings, 23 main concepts were identified through the coding process: defect rate; evaluation; environment; general quality issue; product; project; quality features; requirement documents; requirement models; requirement representation policy; rework rate; scenario; scenario documents; scenario models; software life cycle; software technology; use case; use case concepts; use case documents; use case models; use case representation policy; user story documents; and user story models.

As examples of the coding process, terms such as “defect fee,” “defect rate,” “defect ratio,” “error rate,” “fault rate,” and “mistake rate” were grouped in the main concept “defect rate.” More specific terms such as “ambiguity,” “clarity,” “completeness,” “comprehensibility,” “concise,” “correctness,” “readability,” “traceable,” “understandability,” “usability” were grouped in the main concept “quality features.” To group terms such as “application method,” “development approach” and “software technique” we used “software technology.” Appendix 3 presents the complete list of concepts, their meaning and the respective terms that generated them.

Not all research protocols reported terms related to every main concept, although the similarity of concepts was a lot higher than the terms (Table 6). Out of the 23 concepts, two (“general quality issue” and “software life cycle”) are present in all the seven strings. However, eight (“environment,” “rework rate,” “scenario documents,” “scenario models,” “use case concepts,” “use case representation policy,” “user story documents” and “user story models”) are present in either the Red team (“use case concepts”) or the Purple team (the other seven concepts). It made the Purple team, along with the Yellow team, present the worst similarity indexes for the concepts as depicted in Table 6.

Table 6 Semantic Perspective - Jaccard index (in %) for main concepts used in common between each SLR pair

Upon these better results regarding the similarities of the main concepts, we had the expectation that the papers returned in common would be evaluated using a similar perspective, showing that even though the different searches did not retrieve the same papers, the teams had the intention to do so. While the analysis of the inclusion and exclusion criteria similarity leads to better results in comparison to previous similarity analysis (Table 7), a closer look at the actual matches among the teams highlights that they barely changed the inclusion and exclusion criteria given in the initial protocol (Appendices F and G). Furthermore, they diverged on papers they should include or exclude in many cases. As an example, an initial comparison among all seven quasi-SLRs revealed that from 2167 articles (up to 2010) only two papers were returned in common. One of these two articles (Losavio et al. 2004) was unanimously excluded because it does not relate to use case quality attributes, but to software architecture design. The other paper (Ramos et al. 2009), though, led to different decisions among the teams: four teams included the paper for data extraction (Black, Red, Pink, and Green), while three excluded it (Purple, Blue, and Yellow). Analyzing the paper, we could observe that the empirical study it presents might have caused the divergence among the decisions since its authors labeled the study as a case study, but the study description indicates to be a proof of concept (misunderstandings concerning empirical/experimental study strategies).

Table 7 Semantic Perspective - Jaccard index (in %) for inclusion/exclusion criteria used in common between each SLR pair

According to the research question, the quality attributes for use cases should be empirically studied to avoid reporting speculative attributes. Some teams (Pink, Purple, and Yellow) explicitly excluded papers (exclusion criteria – Appendix 7) that did not present any empirical study, or that presented either a toy example or a proof of concept, considering they would provide unreliable quality attributes. Some interesting issues were observed regarding it while analyzing the teams’ reports and BibTeX (available in the study package). They show different expectations concerning the empirical aspect of the quality attributes (no explanation concerning the empirical focus used):

  1. (i)

    The teams Black, Red, Pink, and Green included (Ramos et al. 2009) for evaluation considering it was a case study (as labeled by the paper’s authors);

  2. (ii)

    The Purple and Blue at first included (Ramos et al. 2009), but afterward excluded it. No explanation for the exclusion was provided (tacit knowledge regarding the study selection strategy).

  3. (iii)

    The Yellow excluded (Ramos et al. 2009), considering that no study was described regarding quality attributes for the use case.

Table 8 presents the Kappa agreement on including and excluding papers in each pair of SLRs.

Table 8 Semantic Perspective - Kappa coefficient (in %) for papers included and excluded between each SLR pair

The negative values indicate that the probability of teams agreeing by chance while including and excluding papers is higher than their relative agreement. In the two particular cases in Table 8, neither the pair Purple-Green nor the pair Blue-Green had any common paper included. Figure 4 shows that most of the agreements in the pair Blue-Green lied in the papers they excluded in common (22). No paper (zero) was included in common by the teams, although both had included different papers they found in common: one paper was included only by Blue team and four papers were included only by Green team (see intersection).

Fig. 4
figure 4

Amount of papers returned, included (underlined), and excluded from each team in the pair Blue-Green

Analyzing specifically the included papers (underlined) in the intersection of the pair Blue-Green (Fig. 4), we can observe that Green team included papers not completely related to the research question (misinterpretation of the research question). Two of the included papers – (Preiss et al. 2001) and (Rago et al. 2013) – were not about use cases quality attributes, but about quality characteristics expected for a software according to its requirements specifications (controversial understanding on the research topic). The first paper intends to use these features as the basis for a software development, while the second one intends to extract them from the specifications through mining. The only paper included by Blue team in the intersection – (Fantechi et al. 2002) – was indeed related to use cases quality attributes. The teams did not have the same perspective about the research question, as reinforced by the negative coefficient presented in Table 8; neither they had similar inclusion and exclusion criteria, although their similarity in Table 7 says the contrary (tacit knowledge regarding the study selection strategy).

Overall, the teams had a higher agreement regarding the semantic perspective when compared to the syntactic perspective. However, we did not find any reasonable explanation for this behavior because they barely changed the inclusion and exclusion criteria given in the initial protocol, and no other information could be obtained from their research protocols in order to support the understanding of such results. Table 9 summarizes the information concerning the returned and included papers in common per pair of teams, providing an overview of the findings and corroborating the previous results.

Table 9 Returned and included papers in common per pair of teams

4.3 Same Studies and Different Outcomes

Each one of the seven teams elaborated a list of quality attributes for use cases. Thus, 83 distinct quality attributes (complete list in Appendix 11) were extracted from the seven quasi-SLRs. From this total, 29 (~30%) were presented in at least two lists, and just five quality attributes (consistency, correctness, completeness, readability, and understandability) were presented in all seven lists. In addition to the overall analysis involving the seven quasi-SLRs, we also compared their lists in pairs. Table 10 summarizes the percentage of quality attributes found in common between each pair. It is important to observe that the comparison among the teams Black, Red, and Pink, and Purple, Blue, Green, and Yellow was accomplished considering quality attributes found exclusively in the papers published up to 2010.

Table 10 Jaccard index (in %) for quality attributes found in common between each SLR pair

The percentages presented in Table 10 indicate a low level of agreement regarding the quality attributes for use cases, once no pair of teams could identify at least 50% of quality attributes in common. This fact can be partially explained by the low level of agreement regarding the papers selected by each team, that is, in most cases, the SLR teams analyzed different papers.

It is possible to accomplish an additional analysis: what is the level of similarity considering only the papers included in common by two teams? Therefore, the final report of each team was analyzed to extract: i) the papers included in common by two teams, and; ii) the quality attributes identified in each one of these common papers. Regarding item (ii), unfortunately, the Red and Pink teams did not report the quality attributes per paper (imprecise reports), which made it impossible to compare the findings of these two teams with the others. Thus, the comparison was accomplished between each pair composed of the teams Black, Purple, Blue, Green, and Yellow. The terms were compared by their syntax, not exactly by their meanings, since not all the reports presented a complete and detailed information on the attributes gathered from the selected studies (incomplete reports).

Table 11 summarizes the number of papers included in common, the total amount of quality attributes for use cases identified in these papers and the number and percentage of the quality attributes determined in common by each pair of teams.

Table 11 Quality attributes in common per pair of teams

Again, the percentages highlighted in Table 11 indicate the low level of agreement between each pair of teams regarding quality attributes for use cases, even when these teams analyzed the same set of papers. In the worst case, the Green team had no paper in common with Purple and Blue and just two quality attributes in common with Black and Yellow. A thorough investigation of the quality attributes extracted by Green revealed that most of these attributes are not related to use cases, such as accessibility, the complexity of source code, safety, pluggability, portability and support for parallel development, among others (report of information not related to the research topic). The Green team also presented some great divergences with other teams during the semantic analysis (Table 8 and Fig. 4), and these last results acknowledge their difficulties in interpreting the research question and the research protocol itself (misinterpretation of the research question). Interesting enough, Green did not seem to have problems with the search string elaboration (next section), different from other groups.

It was also observed that a specific paper reported the “7C’s of communicability” (Phalp et al. 2007) as a group of attributes related to use cases quality (coverage, cogent, coherent, consistent abstraction, consistent structure, consistent grammar and consideration of alternatives). The teams that selected this paper extracted and reported the seven attributes individually, but the Blue team extracted just one attribute (communicability) from the same paper, that is, communicability was used as a surrogate for the seven other attributes. We are aware that there is no rule concerning the way the studies should be synthesized. However, explanations regarding the perspectives used for data extraction and synthesis are necessary to allow the study understanding and replication (subjectivity of the research synthesis strategy).

Moreover, it is also possible to observe that the Black and Purple teams had high convergence degree regarding the quality attributes for use cases when they analyzed the same set of papers. However, the data collected from their protocols do not allow us to conjecture why this convergence came up, since the levels of syntactic and semantic agreement between these teams are low, and the sets of selected papers are quite divergent.

4.4 Study Conclusion

The results presented in sections 4.1 to 4.3 show that the same research question led to different protocols, considering the syntactic and semantic perspectives, as well as at various outcomes. These results indicate that similar groups of novice researchers can elaborate distinct protocols and, consequently, obtain different outcomes when trying to answer the same research question. As this conclusion affects the reliability of SLRs conducted by novices, some of the pitfalls presented in this section can also be experienced by practitioners that have never undertaken an SLR before.

As mentioned previously, we understand that practitioners can be seen as more experienced than novice researchers concerning SE topics, which might prevent them from generating some of the mentioned pitfalls. However, the differences in SE terminology adopted in the industry and academia can bring some difficulties to practitioners even in this regard, which forces us to discuss challenges on surveying evidence in SE as a way of making this research tool more feasible/reliable for both researchers and practitioners.

Prior to discussing the challenges of SLR planning and execution in SE, the next section presents the quality assessment we performed on the seven quasi-SLR protocols and reports to identify additional issues the novices had that might led them to give the divergent results herein presented.

5 quasi-SLR Research Protocols and Report Grades: Quality Assessment

The differing results led us to question about the quality of SLR search protocols and reports. Therefore, we decided to assess them using two different strategies: i) reviewing each research protocol and report based on a set of criteria adapted from the Database of Abstracts of Reviews of Effects (DARE) criteria (NHS Centre For Review And Dissemination 2002) and ii) calculating the precision and recall (Diest et al. 2009) of each search string.

5.1 Assessing the Protocols and Reports through a DARE Criteria Adaptation

We adapted the set of criteria from DARE – which is used to evaluate SLRs in the medical field – to support the team’s assignment evaluation. DARE provides five questions related to the existence and/or the quality of inclusion/exclusion criteria, search for evidence, selected results assessment, selected results details, and selected results synthesis. Having these as inspiration, we created a scoring (ranging from 0 to 10) to suit the context of the given assignment which consisted of: i) checking whether the teams did not change the initial protocol (5), changed it to be better (10) or to be worse (0) in the case of the planning; ii) checking whether they applied their planning; and; iii) checking whether their final report has a reasonable level of detail. Assessing the completeness of their planning and report would require an oracle protocol and an oracle report, which were not the case, and for this reason, we decided not to use this perspective for the assessment. Table 12 presents the criteria utilized for the protocols assessment along with reasonable judgments concerning each criterion and their respective scores for the assignment.

Table 12 DARE criteria adapted to assessing the research protocols

The three authors assessed the seven research protocols individually according to the criteria above, using the mean to assign a protocol score and taking notes on the issues whenever necessary. Each final team score was given by the average of all three evaluations. Table 13 presents the final score of each team concerning their research protocol, and it also includes the main pitfalls identified during the assessment.

Table 13 Research protocols scores

Along with the evaluation of the research protocols, we accomplished the evaluation of the reports. Differently from the protocol, for the case of the reports, we had no baseline for comparison, so we based our assessment on the amount of useful and understandable information their report provided. Table 14 presents the criteria used for the report’s assessment along with possible judgments concerning each criterion and their respective scores for assignment.

Table 14 DARE criteria adapted to assessing the reports

Similarly, the authors assessed each report individually and assigned their scores and comments based on the criteria above. The same calculation used for the protocols final scores were performed for the reports. Table 15 details the assessment results.

Table 15 Report scores

As one can see from the protocols and reports scores, there is a tendency that low scores protocols could lead to low score reports. Likewise, high scores protocols could result in high score reports. This type of assessment does not consider the correctness of the results because there should be an oracle for protocol and report comparison. Thus, it evaluates whether the relevant information is presented and whether they are detailed enough for further analysis and/or aggregation. We did try to make some adaptations on DARE to fit our needs, however, to assess the correctness of the SLRs we went for a more appropriate analysis: the precision and recall.

5.2 Evaluating the Searches through Precision and Recall Analysis

To accomplish the precision and recall of each search strategy we first needed to have a baseline of the relevant papers that can answer the research question. To do so, the three authors had to go through all the 2435 papers returned by the seven quasi-SLRs and evaluate them according to their perspective on the research question, taking two issues into consideration: to guarantee a common empirical study focus (experimental and empirical studies would be accepted); and a common use case quality focus (use case quality can be observed while constructing use case diagrams and descriptions and while inspecting either of them).

For the selection strategy, we followed these two steps:

  1. 1.

    Each author read the title and abstract of the papers, evaluating them according to his/her understanding of the research question and the inclusion/exclusion criteria described in the initial protocol. Each paper was rated as I (Include) or E (Exclude);

  1. a.

    A consensus would define the final status of the paper;

  2. b.

    Whenever two authors decided for the exclusion of the paper, the paper would be excluded;

  3. c.

    Whenever two authors decided for the inclusion of a paper and the remaining author for its exclusion, the paper should be marked for a double check analysis.

  1. 2.

    Papers marked for second check analysis should be evaluated once more, now after reading the paper (not necessarily a full reading, though).

  1. a.

    The majority of the decisions would define the final status of the paper.

From the 2435 papers, we (authors) agreed to include 32 papers (29 up to 2010, and three from 2011 to 2012 – Appendix 10) and exclude 2318 papers, which means an agreement of 96.5%. The remaining 85 papers did not achieve a consensus regarding inclusion or exclusion, so they were excluded. This result allowed us to define the precision and recall of each search (Diest et al. 2009), using the papers we selected as the universe of relevant papers for answering the research question (quasi-golden standard (Zhang et al. 2011)). Table 16 presents the precision and recall of each search, separating the papers up to 2010 from those up to 2012.

Table 16 Precision and recall of each search in %

Comparing the results from the previous subsection with the precision and recall of the searches, we could notice that the best protocols and reports also presented the highest precision and recall. However, low score regarding protocol and report did not directly lead to low precision and recall and vice-versa. For instance, we expected that Black would have presented very low precision and recall, and Green would have presented very low protocol and report scores, which is not true in comparison to other teams. The coincidences in the two different types of assessment seem to be more related to the effort the teams put into the assignment. Pink and Yellow presents the highest concentration of ESE group members whose supervisor happens to be the professor of the discipline. Also, Red and Green teams, which presented low score for protocol and report, and for precision and recall, respectively, are only composed by master students, even though there is a presence of practitioners in both of them.

6 Challenges and Pitfalls on SLR Planning and Execution in the SE Field

The observed results were different from our expected results as similarities in the research questions and protocol did not lead to similar outcomes (see Fig. 5). We are aware that the vast differences in the search strings induced the high quantity of different returns, but even when analyzing the same set of papers included by the teams, the quality attributes for use cases were not the same. Still, this study allowed us to identify some pitfalls of planning and reporting the SLR studies (underlined in the previous sections) that probably caused the differences identified throughout this exploratory study. As we are going to discuss in this section, six main reasons can be highlighted as challenges for conducting SLRs in SE and might explain the mentioned pitfalls; they are the lack of:

  1. (i)

    experience in the investigated topic that caused the novice researchers to misuse terms in the search string, include studies and give answers not related to the research question;

  2. (ii)

    the experience of novice researchers in systematic reviews promoting inconsistencies in their review protocol and execution, and making them do unnecessary work and not report relevant information;

  3. (iii)

    a common terminology regarding use cases, requirements and quality attributes that made novice researchers search for studies using a variety of different terms and report the results inconsistently;

  4. (iv)

    clearness and completeness of the papers that might have caused their inconsistent inclusion/exclusion of papers among the reviews;

  5. (v)

    verification procedures to support the identification of inconsistencies throughout the quasi-SLR process, and;

  6. (vi)

    commitment or interest with the research topic that caused the novice researchers to overlook important features of SLRs, not report significant decisions made during its process neither report details on the results.

Fig. 5
figure 5

Expected results versus observed results

The next subsections present a couple of observations supporting these believed main challenges. They might not be representative of all SLR experiences in SE, but they might help us understand some common issues identified in our field mainly when novice researchers (especially concerning the research method) perform secondary studies. Along with the challenges, we will also present some proposals that might be used to overcome them.

6.1 Lack of Experience in the Topic

“Keywords too high-level,” “inappropriate selection of search terms,” “report of information not related to the research topic” and “controversial understanding of the research topic” may be related to lack of experience in the topic or even lack of knowledge about the topic terms that are used by academia. As previously mentioned “use cases” was chosen as a subject for investigation, and more specifically “quality attributes for use cases,” because we believed it is a ground topic in the SE field in which the novice researchers would have little misunderstandings. Also, to make the teams more similar to practitioners applying the evidence-based software engineering approach, we included in each group at least one participant with high experience in the software industry. However, we could observe some misunderstandings possibly related to the experience of the novices in the quasi-SLR topic.

Regarding use cases, it was possible to notice that Green extracted from the selected papers several attributes of quality not related to use cases, as described in Section 4. A thorough analysis of Green members’ profiles reveals that one of them has a poor background regarding use cases. However, the other two members have enough experience to avoid these mistakes. As we do not know how the SLR tasks were distributed among the members, we can just speculate other causes, such as lack of verification procedures or lack of team commitment, as will be discussed later.

The differences concerning the applied inclusion/exclusion criteria (most of them not described in the protocols) aside from representing issues on describing important decisions, stress different perspectives regarding the SLR topic among the teams, evidencing the impact that knowledge and experience in the topic might affect the results. Hence, these findings indicate that expertise in the SLR topic seems to play an important role, especially concerning the keywords definition, the selection of works and the extraction of the results. Therefore, it is important to have a look at some seminal works in the research topic to get in touch with the vocabulary used in the area to minimize the effects of these issues. A proper selection of control articles also contributes to align the expectations concerning what to look for and what to select as included papers.

6.2 Lack of Experience in the Method

“Keywords too high-level,” “unnecessary search terms,” “many different combinations of terms producing noise return,” “overuse of ‘and’ operators and distributive properties,” “misuse of the guidelines,” “no explanation concerning the empirical focus used,” “tacit knowledge regarding the study selection strategy,” “misinterpretation of the research question,” “imprecise reports,” “incomplete reports” and “subjectivity of the research synthesis strategy” may be related to lack of experience in the research method. The vast differences identified among the returned papers from the seven quasi-SLRs, even when some agreement regarding the search string was found, revealed the importance of organizing the search string properly.

It was observed among the teams that although they shared some concepts and terms, the terms were placed in different parts of the logical structure of the search string, leading to worse similarities results concerning the returned papers when compared to the terms similarity. All teams were instructed to use the PICO structure (Population AND Intervention AND Comparison AND Outcome; no Comparison expected), but we observed the abstraction of this structure was not properly applied in the quasi-SLRs. Some teams did not follow the PICO structure, but played with the search string logical structure, creating an additional dimension and even chaining logical operators as Pink. Other teams placed terms of population along with the intervention (Purple and Green) and others, terms of population along with outcome (Black, Red, and Purple). We expected the novice researchers to use as population, articles describing software development projects at the stage of requirements and empirical/experimental studies related to requirements; as intervention, use case descriptions or diagrams, or formats/guidelines/standards for its description; and as an outcome, quality attributes for use cases. In this case, no comparison was expected since we did not want to make any comparison between the intervention and a specific use case description/modeling way.

Other observed issues with the research protocols analysis were the insertion of papers satisfying exclusion criteria during the quasi-SLR execution and the evidence of missing information that might have affected the novices’ decision. Purple evolved the exclusion criteria using too specific criteria, for instance: “articles about product line reporting quality attributes for use cases,” and “articles about techniques that lead analysts to elaborate use cases such as prototyping, UI/GUI technologies, task models, sketching and mock-ups.” Also, in many cases, we could neither identify criteria for including and excluding some specific papers nor find any explanation about the analysis and synthesis procedure used by the teams, facts that could have biased the studies selection. These behaviors made us wonder whether the teams had followed the research protocol and whether they had seen the importance of planning as much as possible before reviewing to avoid biasing the findings. As mentioned in previous works, novices might take advantage of more iterative approaches for SLR executions (Oates and Capper 2009; Lavallée et al. 2014), since they can improve novices’ understanding concerning the method and information that should be described in order to consistently continue the SLR tasks, avoiding biasing the selection and reports.

Additionally, knowing the selected search engines’ properties beforehand can support a better use of them, optimizing the search strategy by cutting additional search terms and, and thus reduce the noise in return.

6.3 Lack of Clearness and Completeness of the Papers

“Misunderstandings concerning empirical/experimental study strategies” may be related to lack of clearness and completeness of the papers under evaluation. The case in which four out of seven teams decided for the inclusion of the same returned paper, and the others, for its exclusion (section 5.2) raised the question whether the empirical study was well explained and categorized in the article. The paper does not report many details about the described case study (Ramos et al. 2009) and this fact might have hampered the team’s inclusion/exclusion judgment.

The teams’ knowledge and experience in empirical studies might have also contributed to different results, but we did notice teams Red, Green and Yellow excluded many papers after reading them in full (according to their BibTeX), that is, they could not decide on the inclusion/exclusion just by reading the title and abstracts of the papers.

It is one more indication that we must keep spreading the need for using structured abstracts and reporting the guidelines utilized in the performed primary studies when writing scientific papers, as well as writing for the synthesis of evidence as advised by Wohlin (2014).

6.4 Lack of a Common SE Terminology

“Keywords too high-level,” “many different combinations of terms producing noise return,” “inappropriate selection of search terms,” “misunderstandings concerning empirical/experimental study strategies,” “controversial understanding of the research topic” may be related to lack of a common SE terminology.

The Black team tried to define general terms such as “scenario,” “guidelines,” “quality attributes,” and “requirements engineering,” creating the smallest term list among all teams, with only 11 terms. On the other hand, the Purple tried to specify precise terms and completed the task with the biggest term list among all the teams: 215 terms. Analyzing these 215 terms, we observed that the Purple used synonyms and tried to cover all possible combinations, such as “use case engineering,” “use case modeling,” “user scenario engineering,” “user scenario modeling,” “user story engineering,” and “user story modeling,” and the same strategy was applied to the other terms. It was also observed that the other teams had adopted the same strategy of Purple, but without trying to exhaust all combinations. In summary, except for the Black, all the others, to a greater or lesser degree, sought to use synonyms at some point during the keywords definition. However, it is not hard to observe that there is a significant difference among the synonyms adopted by each team, which can explain why each SLR returned such distinct sets of papers.

The choice of different synonyms can be a side effect of a lack of knowledge/experience in use cases or other topic related to the SLR, but we do not believe this is the case once the chosen synonyms, with some exceptions, are related to use cases, quality attributes, and empirical studies. In this particular case, possibly an adequate explanation is the lack of a common SE terminology. As there is not a minimum agreement about the appropriate terminology on investigation topics, the reviewers tend to adopt generic terms or run the risk of choosing a set of terms that is not a common sense among SE researchers.

Thus, it is important to search for taxonomies in the topic of research using them to suppnort the searching for studies and the report of results. This last case is crucial to facilitate future aggregations of two or more studies. Additionally, control papers can be useful in this instance as well.

6.5 Lack of Verification Procedures

“Unnecessary search terms,” “inappropriate selection of search terms,” “misuse of the guidelines,” “misunderstandings concerning empirical/experimental study strategies,” “misinterpretation of the research question,” “controversial understanding of the research topic,” “imprecise reports,” “incomplete reports,” “report of information not related to the research topic” may be related to lack of verification procedures. Most of the issues identified in the assessment of the research protocols and reports could have been overcome with a use of verification procedures, as simple pair reviews.

Many planning and outcome inconsistencies can be observed in the research protocols and reports, such as misalignment among the research questions and search strings (e.g. “fault free” and “testwarehouse”); PICO description not in conformance with the search strings; lack of fields in the extraction forms (Appendix 9) to support the extraction of information to answer the research questions (Appendix 1) or even to assess the quality of the selected papers (Appendix 8); answers not aligned to the research question, among others. These issues highlight the importance of referring to the study research question in each research protocol step while justifying the decisions made.

The studies selection process was also another issue identified in the quasi-SLRs. Some articles are inconsistent with the inclusion and exclusion criteria reported by the teams. For instance, the novice researchers in the Green team should have excluded at least two articles included if they had followed their exclusion criterion: “papers not presenting features about use case diagrams or specifications.”

Regarding the data extraction, we could not find in any of the four research protocols (teams Red, Purple, Blue, and Green) fields to hold information to support the answering of quality assessment questions, although all protocols had data extraction fields concerned with quality attributes for use cases and their evaluation studies. This observation must mean that the teams either did not assess the studies – case of Purple and Green – or had to read the included papers just to evaluate them. The Black team did not plan a study quality assessment in its research protocol; and the Pink, Blue, and Yellow, although planned and reported the evaluation of all included papers, did not use it for any purpose. Only the Red team planned, reported and used the quality assessment to rank the quality attributes for use cases that had been found.

It is important to keep track of the main research question throughout the research protocol elaboration and follow it during the review execution. A general cross-checking, including the results, is advisable to check the plan and outcomes consistency and to increase confidence in the results. Zhang and Babar in (Zhang and Babar 2010) and Zhang, Babar and Tell in (Zhang et al. 2011) provide an interesting approach to be used in order to gather relevant studies during the search step, called quasi-gold standard, being an attractive alternative also to assess the quality of the search strategy used. Likewise, Petersen and Ali (2011) suggest interesting strategies for the study selection that can help in the planning phase of SLRs, mitigating some inconsistency risks during the study selection phase.

6.6 Lack of Commitment to the SLR

“No explanation concerning the empirical focus used;” “tacit knowledge regarding the study selection strategy;” “imprecise reports;” and “incomplete reports” may be related to lack of teams’ commitment. Since the reviews were accomplished in the context of an ESE course, some participants might have faced this task as something purely related to obtaining a mark.

However, another variable to be considered is the time available to accomplish the SLR. Two months might not be enough to internalize the concepts related to the method and apply them in practice. In this case, the unexpected outcomes may be derived from the time pressure to deliver the final report, which in turn led the teams to give up the needed rigor in critical stages of the quasi-SLRs. We could notice that so they could reduce the effort and optimize the time on conducting the review, the participants split the work among themselves, especially concerning the step of reading the returned papers’ title and abstract. This can be observed in the teams’ BibTeX in which in many cases a single paper had the evaluation of only one participant in the team. We also noticed that the limited time might be the cause of the poorly detailed reports (previous section). The amount of detail regarding the identified quality attributes was quite low, and in many cases, no definition for them was reported (Appendix 11).

SLRs demand high team commitment since it is a time-consuming task, its planning requires focus, dedication, and the selection and extraction phases involve a careful reading of hundreds or thousands of papers and a detailed cross-checking. Without this commitment, team members may try to shorten paths and minimize the rigor needed to obtain relevant results. Perhaps the best way to get this undertaking is to include in the team only people who have a direct interest in the SLR outcomes. On the other hand, involving people only to increase labor availability may negatively influence the overall results.

7 Threats to Validity

During this exploratory study, several threats to validity could be identified. Concerning the construct validity, the authors might not have considered all the main features to observe similarities among research protocols and answers, as well as the impact of the former in the latter, although we did consider most of the elements described in an SLR research protocol to accomplish the comparison. We understand that when an SLR research protocol is not well planned, it might lack important information for guiding the SLR execution. Likewise, as we conjectured in the previous subsections, information might be missing from the protocols – such as selection criteria and analysis procedure – because either the students did not update them frequently, or they did not make their reasoning explicit in the SLR plan and report. Still, for simplification matters, we decided to take all the reviews as complete, including the papers included and excluded, and the reported quality attributes. We understand that this decision might have made us overlook some answers and particularities that would have prevented us from comparing the research protocols and reported attributes in different reviews. We could have taken advantage of information gathered from the novice researchers to understand their decisions during the review, but we assumed the review was systematic and repeatable. About the coding process used to measure the similarity of search concepts across the reviews – presented in Section 3.6.1 –, the three researchers involved in it have significant theoretical knowledge and practical experience in the subject (use cases and quality attributes). Also, the coding was held at a meeting where the researchers discussed the different points of view to reach a consensus.

About internal validity, we identified some pitfalls and challenges in the previous sections that might have represented risks to the observed results. We did try to anticipate some problems that occurred in similar studies undertaken with students – presented in Section 2 – in this exploratory study. Therefore, the existing initial protocol was used as a starting point for all teams, and they were advised to use SLR guidelines to support the research protocol evolution. Also, the participants were organized into teams according to their knowledge and experiences in software development and experimentation. Still, all these efforts were not enough to prevent some other uncontrolled factors from happening, as previously presented. Even though all participants received lectures and extra explanations concerned with the SLR executions, the lectures and even the material the students received (including the existing guidelines) might have left room for misunderstanding concerning the assignment.

The use of a non-native language can also be seen as a threat to this study validity. It is possible to conjecture that the elaboration of the search string in English and the reading of papers in a non-native language caused some difficulty that led to some of the previously commented mistakes. We do not believe in this possibility because all the participants are used to reading and writing technical papers and assignments in English, which considerably minimizes this kind of confusion factor. The short time for the SLR execution (two months) is also another threat. Nevertheless, most of the planning was given to the participants in the initial protocol, meaning they did not have to plan everything from scratch.

Regarding external validity, we cannot generalize this study mainly because we observed the quasi-SLRs planned and carried out by novice researchers (especially concerning the method) during a limited time. However, we understand literature reviews have been used as starting points in a lot of SE research executed by researchers (most of them novice ones such as graduate students), thus the investigation of this research strategy used by novice researchers is worth it. Also, considering that in industrial settings practitioners might not be used to reading and synthesizing papers, many of the findings can also be seen as reasonable in these contexts, even though the expertise in SE topics might help them to avoid some of the mentioned pitfalls. One thing we could conclude from this experience with novices: lack of knowledge and expertise in the topic and/or method can either lead to divergent results (most probably) or convergent results by chance.

As for the conclusion validity, the authors performed a coding process on the search terms to identify the main concepts presented in each search string that add some subjectivity to the comparison of the SLRs, and that might hamper this study conclusion and replication. Our intention with this process was to find more generic terms that would support us in identifying some similarities among the protocols, as a way to capture not only the exact terms used for the search but also the intention the teams had on searching for works on similar topics. Thus, the main concepts intend to offer generic representations of the search terms. Still, even in this case, it is not possible to see many similarities. One might think that the same process could be used for the outcomes (quality attributes for use cases) as well, and while this is true, we believe it would not add much to the similarity discussions in this particular work, since the generic term (concept) we could abstract from the outcomes would be a single one (quality features), leading to 100% similarity among all teams.

An interesting coding process to apply to the outcomes is one to identify similar quality attributes with different names, and different quality attributes with similar names. We were not able to perform such analysis since not all reported quality attributes have an associated definition, as can be observed in Appendix 11, making the coding process unfeasible. Any attempt at capturing the quality attributes definitions from the included papers would make the authors interfere the teams’ answers to the research question, biasing the comparison among them. Hence, the comparison of the quality attributes reported by each team was made without any interpretation, that is, the comparison was made through the terms reported with almost no inference. For example: “complex” and “complexity” were considered the same quality attribute, while “size” and “small size” were found to be different ones. The assignment did not specify anything about the level of granularity at which the quality attributes should be reported. Thus, the terms used to report the quality attributes were either quite different or quite similar, as Table 11 (Sectionx 4.3) allows observing. This last remark is related to another important assumption we made through the execution of this study; we assumed that the SLR packages the novice researchers provided were correct and complete. This was a way to properly assess the reliability of SLRs (process repeatability and outcomes consistency) without taking the risk of interfering in the teams’ answers.

8 Conclusions

This work presents and discusses the planning, execution, and the results of SLRs performed by novice researchers, trying to evaluate similarities and differences among the protocols and the sets of selected studies. Although SLR protocols with low similarity have generated results with low similarity, as expected, protocols with some similarity (search string, returned and included studies, and so on) led to different results as well. This result makes us conjecture that SLRs are not reliable (having a repeatable process and leading to consistent results) as we might think, mainly when performed by novices. This discrepancy can be partially explained by the researcher’s inexperience in the SLR’s method and domain since several papers reported in the related works section highlight the importance of the researcher’s experience in practice and the domain as a success factor for the SLR’s repeatability and consistency in results.

On the other hand, the evidence-based software engineering (EBSE) promotes SLRs as the most important instrument to collect relevant information regarding a particular technology aiming at support practitioners in their decision making in SE. This way, the successful planning and execution of SLRs performed by professionals is a critical factor to EBSE.

These scenarios bring many pitfalls and challenges related to SLRs conducted by novice researchers and also practitioners, since there is an inherent difficulty involving this kind of participant, in addition to other factors that cause problems even for the most experienced researchers. We observed that missing information from the research protocols and reports of results mainly related to the adopted analysis procedure could compromise the repeatability of SLRs and the consistency of results. Another observation made through this exploratory study was that whether the main research objective is not used to guide the SLR planning activities, the researchers might bias the findings in the face of new information they encounter during the study selection. If changes are necessary during the process, then the process must be redone. Iterative approaches for SLR executions (Lavallée et al. 2014) combined with verification procedures can support the use of this research strategy by novices (Oates and Capper 2009). Furthermore, an in-depth understanding of the terminology employed in the topic under investigation and the quasi-golden standard is required to support the elaboration of appropriate search strings (regarding their precision and recall) and the reporting of results.

One issue deserves further discussion: would removing the novice researchers, replacing them with more experienced researchers, solve the problem? Could the inclusion of a senior researcher with experience in the SLR topic be able to make the SLR process reliable as mentioned earlier? It is inevitable that there is some degree of subjectivity in the application of the inclusion/exclusion criteria, and especially in the information extraction. This subjectivity is complemented with the previous researchers’ experience, that is, with their perspectives on that topic. This scenario seems to be similar to that discussed in the software artifact inspection studies. Perspective-based reading techniques (Shull et al. 2000) are good examples of the application of multiple perspectives aiming at evaluating a topic from different points of view. In this case, inspectors with various interests and backgrounds increase the likelihood of detecting defects, since different perspectives are explored during the inspection process. In the light of these results, the previous question can be rephrased: does the combination of different angles during the evaluation of studies returned by the search engines play a vital role regarding the repeatability of SLRs and, hence, their reliability; or will a rigorous protocol with senior researcher support be enough? Future investigations have to be conducted to look into the boundaries between what a research protocol can provide to support the reliability of an SLR when the researchers become essential in this process, which strategies can be adopted to mitigate this issue and how these parts can be joined in an efficient manner.