On (Mis)perceptions of testing effectiveness: an empirical study

Vegas, Sira; Riofrío, Patricia; Marcos, Esperanza; Juristo, Natalia

doi:10.1007/s10664-020-09805-y

On (Mis)perceptions of testing effectiveness: an empirical study

Published: 07 May 2020

Volume 25, pages 2844–2896, (2020)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Empirical Software Engineering Aims and scope Submit manuscript

On (Mis)perceptions of testing effectiveness: an empirical study

Download PDF

Sira Vegas ORCID: orcid.org/0000-0001-8535-9386¹,
Patricia Riofrío¹,
Esperanza Marcos² &
…
Natalia Juristo¹

550 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

A recurring problem in software development is incorrect decision making on the techniques, methods and tools to be used. Mostly, these decisions are based on developers’ perceptions about them. A factor influencing people’s perceptions is past experience, but it is not the only one. In this research, we aim to discover how well the perceptions of the defect detection effectiveness of different techniques match their real effectiveness in the absence of prior experience. To do this, we conduct an empirical study plus a replication. During the original study, we conduct a controlled experiment with students applying two testing techniques and a code review technique. At the end of the experiment, they take a survey to find out which technique they perceive to be most effective. The results show that participants’ perceptions are wrong and that this mismatch is costly in terms of quality. In order to gain further insight into the results, we replicate the controlled experiment and extend the survey to include questions about participants’ opinions on the techniques and programs. The results of the replicated study confirm the findings of the original study and suggest that participants’ perceptions might be based not on their opinions about complexity or preferences for techniques but on how well they think that they have applied the techniques.

What Characterizes a Good Software Tester? – A Survey in Four Norwegian Companies

Investigating developers’ perception on software testability and its effects

Article 13 September 2023

Understanding Problem Solving in Software Testing: An Exploration of Tester Routines and Behavior

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

An increasingly more popular practice nowadays is for software development companies to let developers choose their own technological environment. This means that different developers may use different productivity tools (programming language, IDE, etc.). However, software engineering (SE) is a human-intensive discipline where wrong decisions can potentially compromise the quality of the resulting software.

In SE, decisions on which methods, techniques and tools to use in software development are typically based on developers’ perceptions and/or opinions rather than evidence, as suggested by Dybå et al. (2005) and Zelkowitz et al. (2003). However, empirical evidence might not be available, as certain methods, techniques or tools may not have been studied within a particular setting or even at all. Alternatively, developers may simply not be acquainted with such studies, according to Vegas and Basili (2005). On this ground, it is important to discover how well developers perceptions (beliefs) match reality and, if they do not, find out what is behind this mismatch, as noted by Devanbu et al. (2016).

According to Psychology, experience plays a role in people’s perceptions. This has also been observed by Devanbu et al. (2016) in SE. However, this research sets out to discover how well matched perceptions are with reality in the absence of previous experience in the technology being used. This makes sense for several reasons: 1) experience is not the only factor affecting developers’ perceptions; 2) development teams are usually composed of a mix of people with and without experience; and 3) it is not clear what type of experience influences perceptions. For example, Dieste et al. (2017) conclude that academic rather than professional experience could be affecting the external quality of the code generated by developers when applying Test-Driven Development.

We aim to study whether perceptions about the effectiveness of three defect detection techniques match reality, and if not, what is behind these perceptions. To the best of our knowledge, this is the first paper to empirically assess this issue.

To this end, we conducted an empirical study plus a replication with students. During the original study we measured (as part of a controlled experiment) the effectiveness of two testing techniques and one code review technique when applied by the participants. We then checked the perceived most effective technique (gathered by means of a survey) against the real one. Additionally, we analysed the cost of the mismatch between perceptions and reality in terms of loss of effectiveness. Major findings include:

Different people perceive different techniques to be more effective. No one technique is perceived as being more effective than the others.
The perceptions of 50% of participants (11 out of 23) are wrong.
Wrong perception of techniques can reduce effectiveness 31pp (percentage points) on average.

These findings led us to extend the goal of the study in a replication to investigate what could be behind participants’ perceptions. To do this, we examined their opinions on the techniques they applied and the programs they tested in a replication of the controlled experiment. Major findings include:

The results of the replication confirm the findings of the original study.
Participants think that technique effectiveness depends exclusively on their performance and not on possible weaknesses of the technique itself.
The opinions about technique complexity and preferences for techniques do not seem to play a role in perceived effectiveness.

These results are useful for developers and researchers. They suggest:

Developers should become aware of the limitations of their judgement.
Tools should be designed that provide feedback to developers on how effective techniques are.
The best combination of techniques to apply should be determined that is at the same time easily applicable and effective.
Instruments should be developed to make empirical results available to developers.

The material associated to the studies presented here can be found at https://github.com/GRISE-UPM/Misperceptions.

The article is organised as follows. Section 2 describes the original study. Section 3 presents its validity threats. Section 4 discusses the results. Section 5 describes the replicated study based on the modifications made to the original study. Section 6 presents its validity threats. Section 7 reports the results of this replicated study. Section 8 discusses our findings and their implications. Section 9 shows related work. Finally, Section 10 outlines the conclusions of this work.

2 Original Study: Research Questions and Methodology

2.1 Research Questions

The main goal of the original study is to assess whether participants’ perceptions of their testing effectiveness using different techniques are good predictors of real testing effectiveness. This goal has been translated into the following research question:

RQ1: Should participants’ perceptions be used as predictors of testing effectiveness?

This question was further decomposed into:

RQ1.1: What are participants’ perceptions of their testing effectiveness?
We want to know if participants perceive a certain technique as most effective than the others.
RQ1.2: Do participants’ perceptions predict their testing effectiveness?
We want to assess if the technique each participant perceives as most effective is the most effective for him/her.
RQ1.3: Do participants find a similar amount of defects for all techniques?
Choosing the most effective technique can be difficult if participants find a similar amount of defects for two or all three techniques.
RQ1.4: What is the cost of any mismatch?
We want to know if the cost of not correctly perceiving the most effective technique is negligible and depends on the technique perceived as most effective.
RQ1.5: What is expected project loss?
Taking into consideration that some participants will correctly perceive their most effective technique (mismatch cost 0), and others will not (mismatch cost greater than 0), we calculate the overall cost of (mis)match for all participants in the empirical study and check if it depends on the technique perceived as most effective.

2.2 Study Context and Ethics

We conducted a controlled experiment where each participant applies three defect detection techniques (two testing techniques and one code review technique) on three different programs. For testing techniques, participants report the generated test cases, later run a set of test cases that we have generated (instead of the ones they created), and report the failures found.^{Footnote 1} For code reading they report the identified faults. At the end of the controlled experiment, each participant completes a questionnaire containing a question related to his/her perceptions of the effectiveness of the techniques applied. The course is graded based on their technique application performance (this guarantees a thorough application of the techniques).

The study is embedded in an elective 6 credits Software Verification and Validation course. The regular assessment (when the experiment does not take place) is as follows: students are asked to write a specification for a program that can be coded in about 8 hours. Specifications are later interchanged so that each student codes a different program from the one (s)he proposed. Later, students are asked to perform individually (in successive weeks) code reading, and white-box testing on the code they wrote. At this point, each student delivers the code to the person who wrote the specification, so that each student performs black-box testing on the program (s)he proposed. Note that this scenario requires more effort from the student (as (s)he is asked to write first a specification and then code a program, and these tasks do not take place when the study is run). In other words, the students workload during the experiment is smaller than the workload of the regular course assessment. The only activity that takes place during the experiment that is not part of the regular course is answering the questionnaire, which can be done in less than 15 minutes. Although the study causes changes in the workflow of the course, its learning goals are not altered.

All tasks required by the study, with the exception of completing the questionnaire, take place during the slots assigned to the course. Therefore, there is no additional effort for the students but attending lectures (which is mandatory in any case).

Note that the students are allowed to withdraw from the controlled experiment, but this would affect their score in the course. But this also happens when the experiment is not run. If a student misses one assignment, (s)he would score 0 in that assignment and his/her course score would be affected consequently. However, they are allowed to withdraw from the study without penalty in their score, as the submission of the questionnaire is completely voluntary. No incentives are given to those students who submit the questionnaire.

The fact of submitting the questionnaire implies giving consent for participating in the study. Students are aware this is a voluntary activity aiming for a research, but they can also get feedback. Those students who do not submit the questionnaire, are not considered in the study in any way, as they are not giving consent to use their data. For this reason, they will not be included in the quantitative analysis of the controlled experiment (even though their data is available for scoring purposes).

The study is performed in Spanish, as it is the participants’ mother tongue. Its main characteristics are summarised in Table 1.

Table 1 Description of the experiment

On (Mis)perceptions of testing effectiveness: an empirical study

Abstract

Similar content being viewed by others

What Characterizes a Good Software Tester? – A Survey in Four Norwegian Companies

Investigating developers’ perception on software testability and its effects

Understanding Problem Solving in Software Testing: An Exploration of Tester Routines and Behavior

Explore related subjects

1 Introduction

2 Original Study: Research Questions and Methodology

2.1 Research Questions

2.2 Study Context and Ethics

2.3 Constructs Operationalization

2.4 Study Design

2.5 Experimental Objects

2.6 Participants

2.7 Data Analysis

3 Original Study: Validity Threats

3.1 Conclusion Validity

3.2 Internal Validity

3.3 Construct Validity

3.4 External Validity

4 Original Study: Results

4.1 RQ1.1: Participants’ Perceptions

4.2 RQ1.2: Comparing Perceptions with Reality

4.3 RQ1.3: Comparing the Effectiveness of Techniques

4.4 RQ1.4: Cost of Mismatch

4.5 RQ1.5: Expected Loss of Effectiveness

4.6 Findings of the Original Study

5 Replicated Study: Research Questions and Methodology

6 Replicated Study: Validity Threats

6.1 Conclusion Validity

6.2 Internal Validity

6.3 Construct Validity

6.4 External Validity

7 Replicated Study: Results

7.1 RQ1: Participants’ Perceptions as Predictors

7.1.1 RQ1.1-RQ1.5: Comparison with Original Study Results

7.1.2 RQ1.6: Perceptions and Number of Defects Reported

7.1.3 RQ1.1-RQ1.2: Program Perceptions

7.2 RQ2: Participants’ Opinions as Predictors

7.2.1 RQ2.1: Participants’ Opinions

7.2.2 RQ2.2: Comparing Opinions with Reality

7.2.3 Findings

7.3 RQ3: Comparing Perceptions and Opinions

7.3.1 RQ3.1: Comparing Perceptions and Opinions

7.3.2 RQ3.2: Comparing Opinions

7.3.3 Findings

8 Discussion

8.1 Answers to Research Questions

8.2 About Perceptions

8.3 About Opinions

8.4 About Perceptions and Opinions

9 Related Work

10 Conclusions

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s note

Appendices

Appendix A: Program Metrics

Appendix B: Analysis of the Original Experiment

Appendix C: Analysis of the Replicated Experiment

Appendix D: Joint Analyses

1.1 D.1 RQ1.1: Participants’ Perceptions

1.2 D.2 RQ1.2: Comparing Perceptions with Reality

1.3 D.3 RQ1.3: Comparing the Effectiveness of Techniques

1.4 D.4 RQ1.4: Cost of Mismatch

1.5 D.5 RQ1.5: Expected Loss of Effectiveness

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation