1 Introduction

This paper reports an exercise undertaken by staff and students in the Empirical Software Engineering (ESE) group at NICTA (National ICT Australia) to evaluate the reporting guidelines for controlled experiments proposed by Jedlitschka and Pfahl (2005). In spite of the existence of a specialist book to help software engineers conduct experiments (Wohlin et al. 2000), software engineering experiments are still subject to criticism. The guidelines were developed in response to general criticisms of current standards of performing and reporting empirical studies (Kitchenham et al. 2002), and more specific criticisms that the lack of reporting standards is causing problems when researchers attempt to aggregate empirical evidence because important information is not reported or is reported in an inconsistent fashion (e.g. Pickard et al. 1998; Wohlin et al. 2003).

In fact, controlled experiments are performed infrequently in software engineering. In a recent survey of 5,453 software engineering articles from 12 leading conferences and journals, Sjøberg et al. (2005) found only 103 articles that could be categorized as experiments. However, there is evidence that current reporting practice is inadequate. Dybå et al. (2006) had to exclude 21 experiments from their analysis of power because the authors did not report enough information for a power analysis. Authors did not report any statistical analysis for 14 experiments and in seven cases the experiments were so badly documented that Dybå et al. “did not manage to track which tests answered which hypothesis or research question”. This result confirms the need for reporting guidelines for software engineering experiments.

Jedlitschka and Pfahl recognised that their guidelines need to be evaluated, saying:

“Our proposal has not yet been evaluated e.g. through peer review by stakeholders, or by applying it to a significant number of controlled experiments to check its usability. We are aware that this proposal can only be the first step towards a standardized reporting guideline.” (Jedlitschka and Pfahl 2005)

We agree with the need for guidelines to be evaluated. If the guidelines are themselves flawed, they could make the problem of poor quality reporting worse than it is currently.

Our evaluation exercise took place between 5th October 2005 and December 14th 2005. It was organized as a series of eight working meetings each taking between 1 and 2.5 h. In this paper, we report the evaluation method we used and the results of our evaluation. We have already reported our results to Jedlitschka and Pfahl, so the main purpose of this paper is to report our evaluation method, since it might prove useful to other groups wanting to evaluate the next version of the reporting guidelines or future reporting guidelines for other forms of empirical study such as case studies, surveys, or systematic reviews.

In Section 2 we give a brief overview of the proposed guidelines. In Section 3 we discuss the various options available for evaluating experimental guidelines and provide a rational for our choice of perspective-based reviews. In Section 4 we report our evaluation process. In Section 5 we report our evaluation results. In section 6 we discuss our results.

An earlier version of this paper was presented at ISESE06 (Kitchenham et al. 2006). In this paper, we have extended the report of our evaluation exercise to include:

  • A more detailed discussion of our evaluation making it clear that we have adopted a method based on perspective-based checklists.

  • Consideration of the advantages of the guidelines. We identify the questions in each perspective that were addressed by the guidelines.

  • The full list of amendments classified according to amendment type.

  • A list of questions that are applicable to all (or most) perspectives. This will enable other users of this evaluation method to separate general questions from perspective specific questions.

2 Proposed Reporting Guidelines

Jedlitschka and Pfahl (2005) propose the reporting structure for experiments shown in Table 1. Table 1 identifies the recommended section and subsection headings in a report of an experiment together with a brief description of the information required in each section and a cross reference to the subsection in the guidelines that discusses the information that authors should supply in each section.

Table 1 Proposed reporting structure

3 Evaluation Options

At our first working meeting, we discussed various theoretical and empirical evaluation methods and considered the viability of each type. Theoretical evaluation can be based on several different approaches:

  • T1. An assessment of each element in the guidelines from the viewpoint of why the element is included in the guidelines; what it is intended to accomplish in terms of supporting readers to find the information they are looking for; and what evidence there is to support the view that the element is important.

  • T2. A review of the process by which the guidelines were constructed identifying the validity of the source material, the aggregation of the source material, and the evaluation process.

  • T3. Reading the guidelines in order to detect defects and areas for improvement taking the viewpoint of different roles that might want to read a report of a software experiment (i.e. a form of perspective-based reading).

  • T4. Mapping any established experimental methodology guidelines to the reporting guidelines.

Empirical evaluation can be based on a variety of possible approaches, for example:

  • E1. Take a sample of published articles reporting experiments constructed without support of the guidelines and identify whether important information has been omitted from the articles that would have been included if the guidelines had been followed. This is similar to the approach taken by Moher et al. (2001) who compared papers in journals that used the CONSORT guidelines with those that did not. The objection to this approach is that the guidelines being evaluated are the basis for their own evaluation.

  • E2. Take a sample of published articles reporting experiments and re-structure them to conform to the guidelines. Then use the duplicate versions as the experimental material in an experiment aimed at evaluating whether the guidelines make it easier a) to understand the papers and/or b) to extract standard information from the papers.

When deciding which evaluation process to undertake, we considered:

  1. 1.

    Whether the evaluation approach itself was valid i.e. likely to lead to a trustworthy assessment of the strengths and weaknesses of the guidelines.

  2. 2.

    Whether the evaluation approach was feasible given our resources (effort, time and people).

  3. 3.

    Whether the approach was cost effective given the value of the proposed guidelines. We noted that formal experiments are currently not often used in software engineering research. It is possible that industry case studies and surveys might be more relevant.

  4. 4.

    Whether the approach provided a good learning opportunity for our research group. This was an important issue because the group included PhD students who were learning about empirical software engineering.

After evaluating each approach, as summarized in Table 2, we concluded that an evaluation based on reading the guidelines in order to detect defects and areas for improvement (T3) would be the most appropriate evaluation method for us to undertake. We felt that empirical evaluation was extremely problematic. Experiments based on re-writing existing papers would be too difficult for a group including novice researchers. It would also be biased if information required by the guidelines was not available in the original papers. Of the theoretical evaluation methods, we felt the perspective-based reading approach would provide the best learning opportunity for the PhD students and junior researchers, giving them an opportunity to consider the needs of different readers and discuss, with more experienced researchers, how to meet those needs. We chose this evaluation approach to suit our own pedagogical purposes. It is not our intention to claim that it is inherently better than the other theoretical approaches nor to suggest that the other evaluation methods should not be used. All the theoretical evaluation methods are valuable and could be used together as part of a comprehensive evaluation program.

Table 2 Assessment of evaluation methods

4 Applying Perspective-based Reading to Evaluating the Experimental Guidelines

In this section we discuss the process we used to evaluate the experimental guidelines. Our evaluation process was organized as a series of eight meetings each of which took between 1 and 2.5 h and took place between 5th October and 14th December 2005 with a maximum of one meeting a week. The results of each meeting were documented after each meeting to provide feedback to participants. The meeting schedule is shown in Table 17 in the Appendix.

4.1 Evaluation Process

The first issue we considered was how to apply perspective-based reading to the goal of evaluating the guidelines. Conventional perspective-based reading is intended to assist reviewing software artifacts from the viewpoint of stakeholders such as the customer, the designer, or the tester who will use the artifact, see for example Shull et al. (2000). Reviewers taking a particular perspective consider a scenario describing how they will use the artifact and ask questions derived from that scenario. For example, Shull et al. described a tester reviewing a requirements document. The tester is required to generate a test or set of test cases that allow him/her to ensure that the system implementation satisfies the requirements. The tester then answers a number of questions related to the test case generation task.

For our evaluation, it was clear that there were different perspectives related to reading a report of a software experiment and that different perspectives would require different information from the report. However, it was not clear that we could develop appropriate operational scenarios to match perspectives, because we were not intending to review a specific experimental report, we were reviewing guidelines intended to assist writing a report. For this reason we decided to base our review of the guidelines on a checklist of questions related to the information required by each perspective. Thus we ended up applying a hybrid reading method using perspective-based checklists.

We also departed significantly from the standard review process. Instead of having a single review meeting with each reviewer taking a different perspective, we decided to undertake a series of reviews where each review addressed a single perspective. We chose this approach because of the learning opportunities implicit in this process. Assigning individual perspectives to each reviewer would have been more efficient, but it may not have ensured that the same level of scrutiny was given to each perspective.

4.2 Identification of the Relevant Perspectives

Our first step was to identify which perspectives we would incorporate into our evaluation process. We identified the following perspectives of interest:

  • Researcher who reads a paper to discover whether it offers important new information on a topic area that concerns him or her.

  • Practitioner/consultant who provides summary information for use in industry and wants to know whether the results in the paper are likely to be of value to his/her company or clients.

  • Meta-analyst who reads a paper in order to extract quantitative information that can be integrated with results of other equivalent experiments.

  • Replicator who reads a paper with the aim of repeating the experiment.

  • Reviewer who reads a paper on behalf of a journal or conference to ensure that it is suitable for publication.

  • Author who would be expected to use the guidelines directly to report his/her experiment.

We also identified the perspective of the editorial board of journals (or the program committee of conferences) who might choose to adopt reporting guidelines. The adoption or not of a set of international guidelines could have both good and bad impacts:

  • It might suggest to authors that there is a fast track to publication or acceptance by using the guidelines irrespective of the quality of the paper.

  • It might discourage authors of non-experimental studies from submitting to the journal.

  • It might improve the quality of papers.

  • It might improve the quality of reviews.

However, although we believe the perspective of an editorial board is important, we did not think it was one that we could realistically adopt.

For each perspective, we used brainstorming to assess what an individual with each perspective would require from a paper and converted these issues into a number of questions that summarize the issues of importance to each perspective. The checklists we developed for Researcher, Practitioner/Consultant, Meta-Analyst, Replicator and Reviewer are shown in Tables 3, 4, 5, 6 and 7 respectively. Since the tables are rather long, the main keywords for the questions are shown in italics to assist readability. For the Researcher and Practitioner/Consultant perspective we did not attempt to remove duplicate questions thinking that it was important to fully represent each perspective. After applying both of these perspectives, we developed the Meta-analyst, Replicator and Reviewer perspectives. For these perspectives, we concentrated on the main differences between each perspective and the Researcher and Practitioner/Consultant perspectives. After our experience with the first two perspectives, we realized that there would be too much redundancy in the questions if we produced a complete checklist for each perspective.

Table 3 Researcher checklist
Table 4 Practitioner/consultant checklist
Table 5 Meta-analyst perspective
Table 6 Replicator perspective
Table 7 Reviewer’s perspective

We also decided not to attempt to construct a checklist for the Author perspective since it would be too close to the Researcher perspective. Instead we decided to undertake a separate review of the guidelines where we considered each element in turn discussing whether:

  • Including the information would be difficult for authors.

  • The guideline element was necessary.

  • Including the information would improve the paper.

  • Including the information would make the paper more difficult to read or write.

Using a different approach for reviewing from the author perspective gave us the chance to address issues not raised explicitly by the perspective-based questions.

4.3 Validity of Checklist Approach

The validity of the checklist approach depends on the validity of the checklists and that, in turn, depends on the experience of the participants. Table 8 confirms that we included participants with extensive experience either in industry or academia (or both). All participants had experience of performing and reporting empirical studies of various types. Furthermore, all of the participants, except the research associate had some experience acting as reviewers and some of the participants had extensive experience. Only one researcher had experience of acting as a replicator, and only the senior researcher had experience of acting as a meta-analyst, although all the participants had some exposure to the principles of systematic literature reviews (Kitchenham 2004), which are a necessary prerequisite to performing a quantitative meta-analysis. Thus, we have some confidence in the validity of the researcher, practitioner, and reviewers checklists, but less confidence in the validity of the meta-analyst and replicator checklists. We also have a fair degree of confidence that we appreciated the issues associated with reporting empirical studies.

Table 8 Review perspective, paper selection and experience of reviewers

Another practical problem associated with our approach is that because the checklist questions were developed without direct reference to the guidelines, it is difficult to cross-reference the checklist questions to specific guidelines items. However, we think it is more important to have some degree of independence between the evaluation criteria (i.e. checklist questions) and the item being evaluated (i.e. the guidelines) than to have simple traceability between one and the other, so this problem is inherent in the basic approach.

4.4 Performing the Reviews

For the first two reviews, in order to assist us to understand each perspective, we agreed to read a paper reporting an experiment from the International Symposium on Empirical Software Engineering (ISESE 04) at the same time as we read the guidelines. (Note. This initial reading activity took place before the group review meeting.) Four of the 26 papers in the ISESE 04 conference proceedings reported experiments (Abdelnabi et al. 2004; Abrahao et al. 2004; Schroeder et al. 2004; Verelst 2004) and each member of the group chose one of the papers to help with the review process. The choice of paper was not mandated and most people chose to read Verelst’s paper, while no one opted for Abdelnabi et al.’s paper (see Table 8). This preliminary reading was intended simply to set the scene for reviewing the reporting guidelines. For this reason, we thought it was preferable to read an article that interested us rather than mandate the same article for everyone. We note that the relatively small number of experiments reported in a conference specializing in empirical methods confirms that experiments are currently not a major part of empirical software engineering.

While reading their chosen paper, each person in the group took one of the perspectives (self-chosen while ensuring both perspectives are covered). The allocation to paper and perspective is shown in Table 8. Everyone who took the practitioner viewpoint had worked for some time in industry (see Table 8), however, some participants with extensive industry experiment were studying for PhDs. In addition, one of the review team only took part in the later review meetings. He was a PhD student with 8 years industrial experience and two year’s research experience. Participation in the workshops was not mandatory, and some NICTA staff attended only attended one or two meetings. These staff contributed to the discussion of the meetings they attended but are not included in Table 8 and did not coauthor this paper. The senior researcher attended all the meetings and kept a record of the discussions. Minutes were circulated after each meeting.

Although each person reviewed his/her chosen ISESE paper from a particular perspective, in the review meetings (first the research perspective and next the practitioner perspective), they were encouraged to contribute to the discussion of the other perspective. We had originally planned for each person to provide a written list of issues/defects from their allocated perspective. This was done for the first two reviews but not done for the last three reviews. In practice, we worked through each of the questions, discussed any issues arising and agreed whether the question raised any problems or identified defects with the guidelines. After the first two reviews, we did not attempt to allocate individuals to specific perspectives.

The final review taking the author perspective proceeded differently. Again we used the ISESE papers to assist our understanding of the author perspective by re-reading our chosen article before taking part in the group review meeting. However, instead of using perspective-based questions at the meeting, we discussed each section of the guidelines sequentially.

5 Results

We found that the guidelines addressed many of the questions in each perspective (see Table 9). Overall they addressed 11 of the 17 Researcher perspective questions (65%), 12 of the 22 Practitioner perspective questions (55%), 10 of the 14 Meta-analyst questions (71%), 7 of the 9 Replicator perspective questions (78%) and all the 7 Reviewer perspective questions (100%). The percentages for the Researcher and Practitioner are not directly comparable to the percentages for the Meta-analyst, Replicator and Reviewer perspectives because we omitted general questions specified in the Researcher and Practitioner perspective from these perspectives. However, these results imply that specialist viewpoints are quite well-addressed by the guidelines but more general perspectives are less well addressed. In particular, the practitioner perspective is not very well addressed.

Table 9 Questions addressed and not addressed by the guidelines

Although the guidelines addressed many questions, in some cases the guidelines were not specific enough about what needed to be reported and in other cases too precise. Overall, the perspective-based reviews using the Researcher, Practitioner, Meta-analyst, Replicator and Reviewer found 44 unique issues that we believed suggested the guidelines should be amended or clarified (see Tables 10, 11, 12, 13 and 14 respectively). The Researcher perspective identified 13 possible amendments, the Practitioner / Consultant perspective identified 21 possible amendments, the Meta-analyst perspective identified six possible amendments, the Replicator identified three possible amendments and the Review perspective identified one possible amendment. Of these amendments, most (i.e. 32) requested more detailed clarification of the information required in a guideline section. Four amendments requested the guidelines be less prescriptive, three requested more background information; two identified possible additional sections. The remaining three proposed amendments suggested (a) standardizing the contents of each section in the guideline document; (b) moving information from one section to another and (c) avoiding possible repetition.

Table 10 Proposed amendments arising from researcher perspective questions
Table 11 proposed amendments arising from practitioner perspective questions
Table 12 Proposed amendments arising from meta-analyst perspective questions
Table 13 Proposed amendments arising from replicator perspective questions
Table 14 Proposed amendments arising from reviewer perspective questions

We also identified eight items we classified as defects (see Table 15). The most significant defects are D2, D3, D4 and D8. D2 arises because the guidelines are inconsistent with reporting standards used by other experimental disciplines. It is a very significant step to disassociate our discipline from the standards used by all other scientific disciplines. We need to be sure that this step is necessary. At the very least we need to articulate the reasons for this divergence, so software engineering researchers and practitioners understand why it is necessary. D3 is an important issue because it is an area that, if not addressed, may result in guidelines that make reporting experiments worse than it is currently. D4 is a general problem but a significant one. If we cannot write so that practitioners can understand and use our results, empirical software engineering is not very useful. D8 concerns the general principles of guidelines and standards—it should be clear what is mandated and what is optional.

Table 15 Defects identified by perspective-based reviews

Whether D1 is a defect or a design decision depends on whether the guidelines aim to address every section or only the most important sections of a research paper. If the guidelines are aiming for completeness, we suggest the need for appropriate relevant keywords be mentioned since well-chosen keywords will help readers find the paper. Defects D5, D6 and D7 could easily have been classified as possible amendments. D5 and D6 are both related to the reporting of the technology or technologies being evaluated. If such technologies are not properly described it is difficult for practitioners to use them. D7 was a specific example of an issue that arose for several of the suggested report section headings where the guidelines were too specific and should have used more general terms. Another example is the use of the term “subjects” rather than “experimental units”. This raises another general issue that the guidelines may be too people/team centric. They do not address well the large number of technical tool “experiments” that get done in the Software Engineering discipline (of which Schroeder et al. 2004 is an example). Are these considered different types of studies? If so, it would be useful to clarify this in the scope of the guidelines; if not, the guidelines should be amended to make them more relevant to technology-intensive experiments.

The final review based on the Author’s perspective re-iterated many issues noted previously. In particular, we were concerned about suggestions to impose reporting structures that were incompatible with those used in other disciplines, such as the template structure for reporting research objectives and the section headings (see Harris 2002 and Moher et al. 2001 for more conventional section headings). The problem of possible duplication was also reiterated. The main issues that were not raised previously were that:

  • The relationship between the “Experimental Design” and the “Execution” section needed to be clarified. If the first section was really the “Experimental Plan” and was fully reported, then the “Execution” section should be restricted to reporting deviations from the plan.

  • The ordering of sections was not always appropriate, for example sometimes it is necessary to introduce the measurement concepts before specifying the hypotheses.

6 Discussion and Conclusions

The guidelines addressed many of the questions raised by each perspective, but we found many instances where the guidelines might benefit from amendment and eight instances where we thought the guidelines were defective.

Issues arising from the Author’s perspective identified problems with potential duplication of information. Guidelines need to be very clear about what information goes into which section. This is a problem for the “Experimental Design” and “Execution” sections as well as the numerous validity sections.

Our results suggest that the main problems with the current version of the guidelines are:

  1. 1.

    Relationships among the individual elements are not clear in the case of reporting validity issues and the reporting of planned tasks versus actual conduct. Thus, it is difficult to be sure what information to put in which section. There is also a risk that the guidelines will result in unnecessary duplication that would make experimental reports less readable.

  2. 2.

    In places, the guidelines require us to adopt reporting standards that are inconsistent with those of other disciplines. For example the suggested headings are inconsistent with the IMRAD standard (see Harris 2002 and Moher et al. 2001). We need to be absolutely certain that this is a good idea.

Our results suggest that the guidelines need to be revised. Any revised guidelines will need to be subjected to further theoretical and empirical validation if they are to be generally accepted. We also need to review research results in other disciplines that might provide additional justification for the guidelines structure and contents. For example, as noted by Jedlitschka and Pfahl (2005), Hartley (2004) provides a summary of the numerous studies that have assessed the value of structured abstracts.

A limitation of our evaluation methodology (review using perspective-based checklists) is that we started our evaluation with perspectives that included general questions and ended it with perspectives that included mainly perspective-specific questions. Furthermore we did not check whether some questions were in essence the same but were asked in different ways. We believe that it is preferable to have a separate list of general questions and another list of specific questions for each perspective. Table 16 identifies a set of 17 general questions cross-referenced to the perspectives from which they were obtained; the questions that are the same or similar in other perspectives; and the perspectives to which they apply. Analyzing Table 16 with respect to questions addressed by the guidelines identified in Table 9 shows that the guidelines provide very good coverage of general questions, with 15 of the 17 general questions (88%) addressed by the guidelines.

Table 16 General questions

Our choice of evaluation method seemed to work well for an initial theoretical validation. Our approach of multiple reviews fitted well with the training element of our evaluation exercise but is not an essential element of a review based evaluation. It would be much quicker to perform a single review with individuals each taking a different perspective. We suggest that a similar review-based evaluation should be performed on the revised guidelines. This type of evaluation would be appropriate for any research group that includes staff with research and industrial experience. It would be useful for any research group intending to adopt the guidelines to undertake such an evaluation. With respect to the other evaluation options listed in Table 2, we believe that most of the evaluation options are useful and viable for specific stakeholders:

  • Evaluation of each guideline element (i.e. T1 in Table 2) and determining the mapping between the new guidelines and existing experimental guidelines (i.e. T2) should be performed by the guideline developers.

  • Evaluation of the guideline development process (i.e. T3) is the responsibility of the research community, so could be undertaken by research networks such as the International Software Engineering Research Network (ISERN, http://isern.iese.de/network/ISERN/pub/).

  • An empirical evaluation method based on comparing the completeness of papers prepared using the guidelines with those that do not (i.e. E1), cannot be undertaken until guidelines are more widely adopted.

  • An empirical evaluation method based on rewriting existing papers in order to conform to the guidelines and comparing them with the original versions (i.e. E2), requires a substantial research effort and would be best addressed by a research network. However, experimental validation involving re-writing existing experimental reports poses a number of practical problems. A significant problem is that it is difficult to assess how well written any experimental report is, so it may be difficult to assess the before and after versions of a report objectively. In addition, re-writing an existing report will depend on the expertise of the researchers doing the re-writing and the quality of the original report, not just the quality of the guidelines.

An important issue raised by the evaluation exercise is that of the Practitioner/Consultant viewpoint. The guidelines did not fit this perspective well. Attempts to address this perspective would make papers much longer and probably more complex. Would it be better to have different standards for practitioner-oriented papers? On the one hand, it can be argued that experiments in software engineering are not relevant to practitioners because they usually involve students, and/or simplified tasks and materials, and/or unrealistic settings. This would suggest practitioners only want to read case studies or industrial surveys. On the other hand, even if controlled experiments are not representative of industry practice, they provide proof of concept information without which industry is unlikely to undertake any realistic case studies. One course of action may be to re-write research results for practitioner-oriented magazines (as long as copyright issues are addressed). However, it may also be beneficial to identify the issues that are most important to practitioners and ensure they are covered by the current guidelines.

This paper has evaluated guidelines for controlled experiments. However, we believe that software engineering needs reporting guidelines for other types of empirical studies, in particular, case studies performed in industrial settings and industry surveys, not least because these types of study are of most relevance to practitioners. We believe that many of the perspective-based questions related to Researchers, Practitioners, and Reviewers are quite general (with the exception of questions that relate specifically to the methodology used for formal experiments) and can be used to help evaluate reporting guidelines developed for other forms of empirical study. Even the Meta-analyst perspective and the Replicator perspective are relevant to other forms of study although the questions would need to be revised. In particular, any attempt to construct and evaluate guidelines for industrial case studies and surveys should ensure that the Practitioner perspective is fully considered.