1 Introduction

The opinions and perspectives of software developers matter significantly in software engineering research. The research in the domain relies on the experience and knowledge of software practitioners to validate assumptions and evaluate the tools, techniques, and methods addressing software engineering problems. Numerous studies reveal the importance of the practitioners’ perspectives (Eck et al. 2019; Ribeiro et al. 2016; Pina et al. 2022; Al-Subaihin et al. 2021). However, empirical evidence may not always agree with developers’ perspective, opinions, or beliefs. For example, Devanbu et al. (2016) show that software developers have very strong beliefs on certain topics, but are often based on their personal experience; such beliefs and corresponding empirical evidence may be inconsistent. Similarly, despite developers’ fairly common negative perspectives about code clones, Rahman et al. (2010) did not find sufficient empirical evidence to prove a strong correlation between code clones and bug proneness; however, there could be effects of code clones other than bugs. Along the similar lines, a study by Murphy-Hill et al. (2012) casts doubts on practitioners’ and researchers’ assumptions related to refactoring. Finally, Janes and Succi (2012) challenge the practitioners’ perspective on agile process by bringing out the dark side of agility. In this paper, we present a case study of testability smells, i.e., programming practices that may negatively affect testability of a software system, to investigate whether the software developers’ perspectives about testability smells is backed up by empirical evidence .

Researchers and practitioners have proposed various, yet inconsistent, definitions of software testability (Garousi et al. 2019). The most common definition refers to the degree to which the development of test cases can be facilitated by the software design choices (Chowdhary 2009; Zhao 2006; Binder 1994). Specifically, several researchers (Chowdhary 2009; Zhao 2006; Payne et al. 1997) define testability as the ease of testing. In addition, some researchers (Chowdhary 2009; Binder 1994; Zilberfeld 2012) emphasized that testability is not a binary concept but must be expressed in degree or extent. Additionally, other researchers (Chowdhary 2009; Zilberfeld 2012; Pettichord 2002) explicitly connect software design choices with the definition of software testability. Furthermore, some studies (Chowdhary 2009; Payne et al. 1997; Zilberfeld 2012) identify the degree of effectiveness by which test development is facilitated as another characteristic of testability definition.

There has been a significant amount of work on test smells and their effects (Junior et al. 2020; Spadini et al. 2018; Bavota et al. 2012). Test smells are bad programming practices in test code that negatively affect the quality of test suites and production code (Garousi and Küçük 2018). Though looks similar, practices that affect testability are completely different than test smells. Test smells occur in test code while issues affecting testability arise in production code. Also, test smells indicate the poor quality of test cases whereas testability impacts the ability to write tests.

Several researchers have proposed frameworks for measuring and empirically evaluating software testability by considering (1) software design choices, including programming language features (Voas 1996; Singh et al. 2015), (2) software quality metrics, including cohesion, depth of inheritance tree, coupling, and propagation rate of methods (Lo and Shi 1998), metrics including the number of method calls, dependencies, and attributes of a class with testability (Mouchawrab et al. 2005), and other software metrics (Khan and Mustafa 2009; Singh et al. 2015; Bruntinkvan and Deursen 2006), as well as (3) testing effort (Terragni et al. 2020). However, there are still various aspects related to software testability that remain unexplored, including the extent to which specific programming practices (e.g., handling of dependencies, coding style, and access of modifiers) impact software testability.

The goal of this study is to find experimental evidence to validate current practitioners’ perspectives about testability smells. To achieve this goal, we first curate a catalog of four programming practices which can affect software testability, referred to as testability smells. We then gather software developers’ perspectives through a questionnaire survey on testability in general, and our proposed testability smells in particular. Finally, we conduct a large-scale empirical study guided by three research questions to explore the effect of testability smells on test cases, their quality, and on reported bugs at different granularity levels. To support the detection of testability and test smells, we develop a tool named DesigniteJava. To answer the research questions, we curated a dataset of \(1,115\) Java software projects which are publicly available in GitHub, and analyzed them using our DesigniteJava tool.

Our survey shows that software developers consider testability as a factor that impedes software testing and overwhelmingly acknowledged the proposed testability smells. Our results suggest that testability smells show a low positive correlation with test smells at the repository granularity; however, at the class-level, testability smells and test smells do not correlate. Our exploration of the relationship between testability smells with test density reveals no correlation at repository and class granularity. Finally, our observations from our experiment indicate that testability smells do not contribute to bugs. Therefore, developers’ opinions and perspectives might need to be complemented with empirical evidence before bringing it into practice.

This study makes the following contributions to the field.

  1. 1.

    We investigate the extent to which developers’ perspectives is in line with the empirical evidence we found in the context of i.e., testability smells.

  2. 2.

    We consolidate a set of programming practices that affect testability of a software system in the form of a catalog of testability smells. This catalog provides a vocabulary for researchers and practitioners to discuss specific programming practices potentially impacting the testability of software systems.

  3. 3.

    We extend DesigniteJava to detect the proposed testability smells and eight test smells. The tool facilitates further research on the topic of testability smells. Also, interested software developers may use this tool to detect testability smells in their source code to better understand the impact of design choices on testing.

  4. 4.

    We explore the relationships between testability smells and several aspects relevant to test development and bugs. Such an exploration improves our understanding, both as software developers and researchers, of testability and lays the groundwork for devising new tools and techniques to improve test development.

We have made publicly available our DesigniteJava to identify testability smells as well as a replication package at https://github.com/SMART-Dal/testability. We hope this facilitate other researchers to replicate, reproduce and extend the presented study.

The rest of this paper is organized as follows. First, we present related work in Section 2. Section 3 provides overview of the methods. Section 4 presents the initial catalog of testability smells, our questionnaire survey targeting to software practitioners and obtained results, and tool implementation to detect testability and test smells. We present the mechanism followed to select and download repositories from GitHub in Section 5. We discuss results in Section 6 and their implications in Section 7. Threats to validity are discussed in Section 8. Finally, we conclude in Section 9.

2 Related work

Software testability

From existing studies, we found that testability was initially considered for hardware design (Le Traon and Robach 1995; Vranken et al. 1996; Le Traon and Robach 1997). The concepts of hardware testability were then used for software testability (Nguyen et al. 2002; Kolb and Muthig 2006). Afterwards, a great deal of studies has been conducted exploring various aspects of software testability. To measure testability for data-flow software, Nguyen et al. (2005) suggested an approach that uses the satan method, which transforms the source code into a static single assessment form. The form is then fed into a testability model to detect source code parts with testability weaknesses. Bruntinkvan and Deursen (2006) collected a large number of source code metrics (e.g., depth of inheritance tree, fan out, and lack of cohesion of methods) and test code metrics to explore the relationship with testability. The analysis focused on open-source Java applications. The results suggest that there is a significant correlation between class-level metrics (most notably fan out, loc per class, and response for class) and test-level metrics (loc per class and the number of test cases). Vincent et al. (2002) investigated software components testability written in C++ and Java in workstations and embedded systems. Moreover, the authors have suggested an approach named built-in-test for run-time-testability which can provide more testable, reliable, and maintainable software components. Filho et al. (2020) used ten testability attributes, proposed in previous studies, to examine their correlation with source code metrics and test specification metrics (e.g., number of test cases, test coverage) on two Android applications. They found that testability attributes are correlated with several source code metrics and test specification metrics. Chowdhary (2009) presented experiences while applying testability concepts and introduced guidelines to ensure that testability is taken under consideration during software planning and testing. Based on these findings, the authors introduced a testability framework called shock. Furthermore, various resources (Human 2022; Zilberfeld 2012) discussed their interpretation of testability and impact of smells affecting testability.

Assessing testability

Voas (Jeffrey 1991) surveyed the factors that affect software testability, arguing that a piece of software that is likely to reveal faults within itself during testing is said to have high testability. According to this work, information loss is a phenomenon that occurs during program execution and increases the likelihood that a fault will remain undetected. Finally, Voas (1996) compared the testability of both object-oriented and procedural systems, as well as whether testability is affected by programming language characteristics.

Surveys on testability

Various literature surveys on testability have been carried out. Freedman (1991) investigated the testability of software components. Freedman argued that the concept of domain testability of software is defined by applying the concepts of observability and controllability to software. Garousi et al. (2019) examined 208 papers on testability (published between 1982 and 2017) and also found that the two most commonly referred factors affecting testability are observability and controllability. Furthermore, their survey argues that common ways to improve testability are testability transformation, improving observability, adding assertions, and improving controllability. Similarly, Hassan (2015) conducted a systematic literature review on software testability to investigate to what extent it affects software robustness. Results show that a variety of testability issues are indeed relevant, with observability and controllability issues being the most researched. They also found that fault tolerance, exception handling, and handling external influence are prominent robustness issues.

Test smells

A wide variety of studies explored test smells and their effect and relationship on various aspects of software development, including change and bug proneness (Spadini et al. 2018), maintainability (Bavota et al. 2015; Kim et al. 2021; Tufano et al. 2016), and test flakiness (Fatima et al. 2022). Specifically, Spadini et al. (2018) investigated the relationship between the presence of test smells and the change-and defect-proneness of test code, as well as the defect-proneness of the tested production code. Among their findings, they observed that tests with smells are indeed more change- and defect-prone. Regarding maintainability, Bavota et al. (2015) presented empirical studies on test smells, and showed that test smells have a strong negative impact on program comprehension and maintenance. They, also, found that comprehension is 30% better in the absence of test smells. Furthermore, Kim et al. (2021) conducted an empirical study to study the evolution and maintenance of test smells. They found that the number of test smells increases as a system evolves, and through a qualitative analysis they revealed that most test smell removal is a maintenance activities. Additionally, Tufano et al. (2016) showed that test smells are usually introduced when the corresponding test code is committed in the repository for the first time. Then, those test smells tend to remain in a system for a long time, hindering software maintenance. Fatima et al. (2022) developed an approach called Flakify, which is a black-box, language model-based predictor for flaky test cases.

Despite extensive work on testability, the existing literature does not translate high-level principles such as observability and controllability into actionable programming practices. Due to that though the high-level testability principles have been known to the community for a long time, there has been no tool support to detect them. We provide a tool that supports the detection of testability smells. Furthermore, we explore the relationship between testability practices and test quality and size, which is our other contribution.

3 Overview of the methods

In the pursuit of weighing developers’ perceptions with empirical evidence in the context of testability smells, we formulate the following research questions.

RQ1.:

To what extent do testability smells and test smells correlate? Testability smells refer to bad programming practices that are believed to make test case design and development difficult. Developers may choose to follow non-optimal practices when it is not easy to write tests, leading to poor-quality test cases. Test smells refer to bad programming practices in unit test code, compromising test code quality by violating the best practices recommended for writing test code (Deursen et al. 2001). This research question explores whether and to which extent testability smells and test smells correlate.

RQ2.:

Do testability smells correlate with test suite size? By definition, testability smells make the design and development of test cases difficult. With this research question, we aim to empirically evaluate whether the presence of testability smells can hinder test development and consequently lead to a fewer number of test cases. By answering this research question, we can inform developers about testability smells that might impede a smooth test development.

RQ3.:

Do testability smells cause more bugs? Testability smells make it harder for developers to test a software system. This implies that the software under test lacks appropriate testing, leaving more bugs uncovered during software development. This research questio n aims to investigate whether and to which extent testability smells can lead to a higher number of reported bugs.

Fig. 1
figure 1

Overview of the study

Towards the goal of the study, as outlined in Fig. 1, we first prepared a set of potential testability smells based on the available literature and recommended practices. We then carried out an online survey to understand developers’ perspectives on software testability and to gauge the extent to which they agree that those smells really impact testability negatively. We extended our tool (DesigniteJava) to detect testability and test smells. We analyzed \(1,115\) Java repositories downloaded from GitHub. After identifying the smells, we reported our observations and findings with respect to each research question.

4 Testability smells

We define testability smells as the programming practices that reduce the testability of a software system. This section presents an initial catalog of testability smells, validates them by carrying out an online developers’ survey, and discusses the implementation details of our tool.

4.1 Initial catalog of testability smells

In order to identify specific programming practices that negatively affect testability, we carried out a light-weight multi-vocal literature (mlr) review, which surveys writings, views, and opinions in diverse form and format (Garousi et al. 2018). The review process has three main stages: search, selection, and information extraction.

In the search stage, two of the co-authors searched for a set of search terms (including testability, ease of testing, and software design+test) on Google Scholar and Google Search. For each search term, we manually searched minimum seven pages of search results. After the minimum threshold of seven pages, we continued the search until we get two continuous search pages without any new and relevant articles. Adopting this mechanism avoided missing any relevant articles in the context of our study.

We applied inclusion and exclusion criteria to filter out irrelevant sources. The main inclusion criterion was that the source’s content must relate to testability. Examples of exclusion criteria include dropping gray literature that is too short, written in language other than English, or presented without objectivity in presentation. These examples map to the Objectivity requirement of mlr process guidelines (Garousi et al. 2018).

In the last stage, we read or observed the selected resources and extracted information relevant to our study. Specifically, we strived for concrete recommendations in terms of programming practices that influence testability from the selected sources. We grouped the practices based on similarity and assigned an appropriate name reflecting the rationale. We identified four potential testability smells discussed by more than one selected source. We present the consolidated set of smells below. It is the first attempt, to the best of our knowledge, to document specific programming practices as testability smells. It is by no means a comprehensive list of testability smells; we encourage the research community to further extend this initial catalog of testability smells.

4.1.1 Hard-wired dependencies

This smell occurs when a concrete class is instantiated and used in a class resulting in a hard-wired dependency (Chowdhary 2009; Zilberfeld 2012; Hevery 2008). A hard-wired dependency creates tight-coupling between concrete classes and reduces the ease of writing tests for the class (Chowdhary 2009). Such a hard-wired dependency makes the class difficult to test because the newly instantiated objects are not replaceable with test doubles (such as stubs and mocks). Hence, the test will check not only the cut (class under test) but also its dependencies, which is undesirable.

In Listing 1, the parseFootnote 1 method creates an object of the BindingOperation class (line 4) and calls a few methods (lines 6 and 7). The object cannot be replaced at testing execution due to the concrete object creation and its use within this method. Hence, the hard-coded dependency is reducing the ease of writing tests for the class.

figure f

4.1.2 Global state

Global variables are, in general, widely discouraged (Marshall and Webber 2000; Sward and Chamillard 2004). This smell arises when a global variable or a Singleton object is used (Hevery 2008; Suryanarayana et al. 2014; Feathers 2004; Thomas and Hunt 2019). Global variables create hidden channels of communication among abstractions in the system even when they do not depend on each other explicitly. Global variables introduce unpredictability and hence make tests difficult to write by developers.

The BuilderFootnote 2 class in Listing 2 is accessible, and hence can be read/written, within the entire project. Such practice makes it difficult to predict the state of the object in tests.

figure g

4.1.3 Excessive dependency

This smell occurs when the class under test has excessive outgoing dependencies. Dependencies make testing harder; a large number of dependencies makes it difficult to write tests for the class under test in isolation (Suryanarayana et al. 2014; Lo and Shi 1998; Zhou et al. 2012). For example, the ErrorFootnote 3 class in the open-source project wsc refers to nine other classes within the project — Bind, BulkConnection, TypeInfo, StatusCode, XmlOutputStream, XmlInputStream, ConnectionException, TypeMapper, and Verbose. Such a high number of dependencies on other classes increases the effort to write tests for this class to be tested in isolation.

4.1.4 Law of demeter violation

This smell arises when the class under test violates the law of Demeter i.e., the class is interacting with objects that are neither class members nor method parameters (Lienberherr 1989; Kaczanowski 2013; Thomas and Hunt 2019). In other words, the class has a chain of method calls such as x.getY().doThat(). Violations of the law of Demeter create additional dependencies that a test has to take care of. For example, lines 4 and 5 of the snippetFootnote 4 given in Listing 3 call a method to obtain an object that in turn calls another method on the obtained object. Such method chains introduce indirect dependencies that reduce the ease of writing tests for the class.

figure h

4.2 Developer survey

We carried out an anonymous online questionnaire survey targeting software developers to understand their perspectives on software testability. Specifically, we aimed to consolidate developers’ perspectives w.r.t. the definition of testability as well as the relevance of our identified testability smells. We divided our survey into three sections. In the first section, we collected information about developers’ experience. In the second section, we asked developers how they define testability. The final section presented our initial catalog of testability smells and asked the respondents whether and to what extent they agree that the presented practices negatively affect testability. All the questions in this section were Likert-scale questions. The questionnaire that we used is available online (Sharma et al. 2022).

Before rolling out the survey to a larger audience, we ran a pilot for the survey, collected feedback, and improved the survey. We shared the survey on all online professional social media channels (such as Twitter, LinkedIn, and Reddit) and sought participation from the software development community. We kept the survey open for six weeks. We received 45 complete responses.

4.2.1 Findings from the survey

Figure 2 presents the demographic distribution of participants in terms of years of experience classified by their roles. We asked them to check all applicable roles and hence the total number of responses shown in the figure is more than the number of participants. It is evident that most of the participants belong to the “software developer” role; a significant number of the participants belong to the highly experienced group (11-20 years).

Fig. 2
figure 2

Demographics (role and experience in number of years) of participants

Fig. 3
figure 3

Respondents’ perspective of considered testability smells

Definition of software testability

We asked the participants a question to elicit the definition of software testability. Most of the responses point to the degree of ease with which automated tests can be written. Some of the actual responses are: “The extent to which a software component can be tested”, “software testability is the degree that software artifacts support testing”, “easy testing”, and “a measure of how easy it is for the code to be tested through automated tests”. An interested reader may look at the raw anonymized responses in our replication package (Sharma et al. 2022).

Programming practices affecting testability

The next set of questions presented four programming practices corresponding to each potential testability smell and asked the respondents whether and to what extent these practices negatively affect software testability. We also asked about the rationale for their choice. Figure 3 presents the consolidated responses for all four considered smells. A very large percentage (\(84\%, 87\%, 78\%\), and \(73\%\) respectively for the four considered smells) of the responses agreed (either completely or somewhat agree) to mark the presented practices as testability smells.

We looked into the rationale provided by respondents for other options (i.e., neither agree nor disagree, somewhat and completely disagree). Specifically, for hard-wired dependency, one of the respondents who marked completely disagree did not offer any justification; another respondent suggested to use mocking. One respondent with a somewhat disagree option for the same smell basing his/her answer on the assumption that dependencies are trivial (i.e., internal class or trivial class from a library) most of the time. Respondents who opted for the option “neither agree nor disagree” either expressed their ignorance about the specific question or left the rationale question unanswered.

The respondents of “somewhat disagree” option for global state smell justify their selection by providing a workaround to test a unit with global variables; for example, one of the respondents provided the following rationale: “Logging where and when the global state is altered is usually good enough for testing the code I work with”.

Whereas, those who chose the “completely disagree” option for the excessive dependency smell, seem, however, to agree that it is difficult to test source code containing this smell based on their open answers (for example, one such an answer states “If designed properly then testing won’t be difficult but yes more dependencies need extra setups”).

For the law of Demeter violation smell, a considerable number of respondents chose the “neither agree nor disagree” option; however, they did not provide any fruitful rationale towards this testability smell.

Additional programming practices affecting testability

We also enquired about other programming practices that may negatively influence the testability of a software system. The responses provided us with additional practices such as poor separation of concern (mixing ui and non-ui aspects), interaction with external resources, such as sockets, files, and databases, time dependencies, asynchronous operations, reflection, invoking command line from code, methods that do not return anything but changes internal state, and requirements for authentication credentials that hinder testability. In addition, the respondents mentioned spaghetti code, highly tangled code, large methods, and non-standard environments as practices that reduce testability. Some of the indicated practices are covered by the proposed smells. For example, interaction with external resources has been captured by the hard-wired dependency smell since external resources such as a network connection need to be instantiated.

The results from the survey not only suggest that the investigated smells are indeed considered practices that affect testability negatively but also provide indicators for the community to extend the proposed catalog.

4.3 The DesigniteJava tool

We extended our tool DesigniteJava (Sharma 2018), to support for testability and test smells detection.Footnote 5 We select DesigniteJava to extend because the tool detects a variety of code smells and it has been used in various studies (Oizumi et al. 2019; Sharma et al. 2020; Eposhi et al. 2019; Uchôa et al. 2020; Alenezi and Zarour 2018). Architecturally, DesigniteJava is structured in three layers as shown in Fig. 4. Eclipse Java Development Toolkit (jdt) forms the bottom layer. DesigniteJava utilizes jdt to parse the source code, prepare asts, and resolve symbols. The source model is the middle layer. The model invokes jdt and maintains a source code model from the information extracted from an ast with the help of jdt. The top layer of the tool contains the business logic i.e., the smell detection and code quality metrics computation logic. The layer accesses the source model, identifies smells and computes metrics, and outputs the generated information in either .csv or .xml files. Due to the existing support to detect smells and compute metrics, various features (such as the source model) can be reused in our context. To support testability and test smell detection, we added code in the code smell detection layer. We also modified the source model layer to extract additional information required for our purpose. The extended version of the tool can be downloaded from its website.Footnote 6

Existing explorations have proposed a few tools to detect test smells. We first tried to utilize existing tools, specifically JNose (Virgínio et al. 2020) and TsDetect (Peruma et al. 2020). We were able to use JNose after taking help from its authors and developing a wrapper to use the tool as a console application. However, a quick analysis of the produced results showed a considerable number of false positive and false negative smell instances. Similarly, we were unable to use TsDetect because it is not suitable to analyze a large number of repositories due to a manual step requiring a mapping of test files and corresponding production files. Finally, we decided to develop our own test smell detector to identify the following eight test smells—Assertion roulette, Conditional test logic, Constructor initialization, Eager test, Empty test, Exception handling, Ignored test, and Unknown test. We selected these smells because these were commonly known test smells and both the tools, i.e., TsDetect and JNose, support them. We implemented the support to detect test smells in DesigniteJava along with testability smells.

Fig. 4
figure 4

Architecture of DesigniteJava tool

4.3.1 Detection rules for testability smells

We summarize below the detection strategies used for the testability smells.

Hard-wired dependency: We first detect all the objects created using the new operator in a class. Then, if the functionality of the newly created object is used (i.e., at least one method is called) in the same class, we detect this smell.

Global state: If a class or a field in a class is declared with public static modifiers, we detect this smell.

Excessive dependency: We compute fan-out (i.e., total number of outgoing dependencies) of a class. If the fan-out of the class is more than a pre-defined threshold, we detect the smell. The literature (Oliveira et al. 2014, 2014b; Ömer Faruk and Yan 2021) suggests a threshold value for fan-out between 5 and 15 with a varying compliance rate. We adopted 7 as the threshold value as suggested by Ömer Faruk and Yan (2021). We ensure that the threshold value is configurable; hence, future studies may change any of the thresholds used.

Law of Demeter violation: We detect all the method invocation chains of the form aField.get-Object().aMethod(). We detect this smell when method calls are made on objects that are not directly associated with the current class.

4.3.2 Detection rules for test smells

The tool uses the definition of test smells and their detection strategies from existing studies (Virgínio et al. 2020; Peruma et al. 2020; Aljedaani et al. 2021). We present a summary of the detection strategies for the considered test smells below.

Assertion roulette: We detect this smell when a test method contains more than one assertion statement without giving an explanation as a parameter in the assertion method.

Conditional test logic: We detect this smell when there is an assertion statement within a control statement block (e.g., if condition).

Constructor initialization: We detect this smell when a constructor of a test class initializes at least one instance variable.

Eager test: We detect this smell when a test method calls multiple production methods.

Empty test: We detect this smell when a test method does not contain any executable statement within its body.

Exception handling: We detect this smell when a test method asserts within a catch block or throws an exception, instead of using Assert.Throws().

Ignored test: We detect this smell when a test method is ignored using the Ignore annotation.

Unknown test: We detect this smell when a test method does not contain any assert call or expected exception.

4.3.3 Validation

We curated a ground truth of smells in a Java project to manually validate the tool, as explained below.

Subject system selection: We used the RepoReapers dataset (Munaiah et al. 2017) and applied the following criteria to select a subject system.

  1. 1.

    The repository must be implemented mainly in the Java programming language

  2. 2.

    The repository must be of moderate size (between 10K and 15K) to avoid toy projects on one side and excessive manual effort on the other

  3. 3.

    The repository must have a unit-test ratio \(>0.0\) (number of sloc in the test files to the number of sloc in all source files)

  4. 4.

    The repository must have a documentation ratio \(>0.0\) (number of comment lines of code to the number of non-blank lines of source code)

  5. 5.

    The repository must have a community size \(>1\) (more than one developer).

We applied the criteria and sorted the list by the number of stars. We obtained j256/ormlite-jdbc, paul-hammant/paranamer, and forcedotcom/wsc as the top three projects satisfying our criteria. The majority of the source code belonging to j256/ormlite-jdbc and paul-hammant/paranamer was in test cases. Hence, we selected j256/ormlite-jdbc,Footnote 7 as our subject system for test smells validation. However, such repositories were not suitable for validating testability smells, since we detect testability smells in non-test code. Hence, we selected forcedotcom/wsc,Footnote 8 a project that offers a high performance web service stack for clients, as our subject system for the manual validation of testability smells.

Validation protocol: Two evaluators manually examined the source code of the selected subject systems and documented the testability and test smells that they found. Both the evaluators hold a PhD degree in computer science and have more than 5 years of software development experience. Before carrying out the evaluation, they were introduced to testability and test smells. They were allowed to use ide features (such as “find”, “find usage” (of a variable) and “find definition” (of a class) and external tools to collect code quality metrics to help them narrow their search space. Both evaluators carried out their analyses independently. It took approximately three full work days to complete the manual analysis. After their manual analysis was complete, they matched their findings to spot any differences. We used Cohen’s Kappa (Berry et al. 1988) to measure the inter-rater agreement between the evaluators. The obtained result, \(89\%\) and \(93\%\) respectively for testability and test smells, shows a strong agreement between the evaluators. The evaluators discussed the rest of their findings and resolved the conflicts.

Validation results: We used our tool, DesigniteJava, on the subject systems and identified testability and test smells. We manually matched the ground truth prepared by the evaluators and the results produced by the tool. We classified each smell instance as true positive (tp), false positive (fp), and false negative (fn). We computed precision and recall metrics using the collected data.

Table 1 Results of manual validation for testability smells
Table 2 Results of manual validation for test smells

Table 1 presents the results of the manual evaluation for testability smells. The tool identified 161 instances of testability smells out of a total of 172 manually verified smell instances. The tool produced two false positive instances and eleven false negative instances. The false positive instances were detected mainly because the tool identified the hard-wired dependency even when an object was instantiated in a method call statement. Similarly, the tool reported false negatives due to an improper resolution of enumeration types; we traced back the inconsistent behavior to the jdt parser library. The precision and recall of the tool for testability smells based on the analysis is \(161/(161+2) = {\textbf {0.99}}\) and \(161/(161+11)={\textbf {0.94}}\), respectively. Similarly, Table 2 shows the results of the manual evaluation carried out for test smells. Out of 301 test smells in 428 test methods, the tool correctly detected 298 smell instances. The cause of three instances of false negative is traced back to inconsistent behavior of the parser library. The precision and recall of the tool for test smells based on the analysis is \(298/(298+0) = {\textbf {1.0}}\) and \(298/(298+3)={\textbf {0.99}}\), respectively. An interested reader may find the detailed manual analysis in our replication package (Sharma et al. 2022).

Generalizability of conclusions: The above validation shows that the tool produces reliable results in almost all cases. Given that the tool has been used by many researchers and practitioners, occasional issues reported by them were promptly fixed, thus further improving the reliability of the tool. A few known issues and limitations of the tool remain. First, due to a symbol resolution issue in jdt, in some very peculiar cases, the tool cannot resolve the symbol that leads to issues such as inability to identify the type of a variable. Also, the tool can identify test smells only when the tests are written in the JUnit framework.

5 Mining GitHub repositories

We use the following mechanism to select and download repositories from GitHub.

  1. 1.

    We use RepoReapers (Munaiah et al. 2017) to filter out low-quality and too small repositories on GitHub. We use quality characteristics provided by the RepoReapers to define a suitable criteria for repository selection. RepoReapers assesses repositories on eight characteristics and assigns a score typically between 0 and 1. We select all Java repositories in the RepoReapers dataset where architecture (as evidence of code organization), community and documentation (as evidence of collaboration), unit tests (as evidence of quality), history and issues (as evidence of accountability) scores are greater than zero. Further, we filter out repositories containing less than 1, 000 lines of code (loc) and having less than 10 stars.

  2. 2.

    We obtain 1, 500 repositories after applying the above selection criteria.

  3. 3.

    We analyze all the selected repositories using the DesigniteJava tool that we developed to identify testability and test smells.

Table 3 presents the characteristics of the analyzed repositories. We attempted downloading and analyzing all the selected repositories; however, we could not download (either due to deleted or made private) and analyze (due to missing tests developed using JUnit framework) some of the repositories. Specifically, we did not find JUnit tests in 300 repositories. We successfully analyzed \(1,115\) repositories containing approximately 46 million loc. Our replication package (Sharma et al. 2022) includes the initial set of repositories, the names of all the successfully analyzed repositories along with the raw data generated by the employed tool, DesigniteJava.

Table 3 Characteristics of the analyzed repositories

6 Results

6.1 RQ1. To what extent do testability smells and test smells correlate?

6.1.1 Approach

The goal of this RQ is to explore the degree of correlation between test smells and testability smells in a repository. To achieve the above-stated goal, we first detect all the considered testability and test smells using DesigniteJava in the selected repositories. We calculate the sum of all testability smells and test smells per repository. Then, we compute smell density (Sharma et al. 2020) to normalize the total number of smells to eliminate the potential confounding factor of project size. Testability smell density is defined as the total number of testability smells per one thousand lines of code (i.e., (number of testability smells \(\times \) 1,000)/total lines of code). Test smell density is defined as the total number of test smells in each test method (i.e., number of test smells/total number of tests). We use the Spearman’s correlation coefficient (Spearman 1961) to measure the degree of association between these two smell types.

Furthermore, we explore the relationship at the class-level. By the fine-grained analysis, we aim to see whether a class C that suffers from testability smells shows a high number of test smells in the test cases that primarily test the class C, and vice-versa. Testability smells occur in production (non-test) code and test smells arise in test code. Hence, we require a mechanism to map a production class with corresponding test classes that test the production class. We implemented the logic of identifying the production class under test for each test case in DesigniteJava. For the analysis, we first find out all the method calls in each test case. Then, we identify the classes of the methods that are called from the test case. It is possible that a test case calls methods belonging to multiple classes; in that case, we attempt to identify the primary class that is being tested by the test case. To do so, we match the name of the test class and the names of candidate primary classes; typically, a test class is named by appending Test in the class name that the test class is testing. If the test class name does not follow the specified pattern and there are multiple candidate classes to be designated as the primary production class, then we pick the first candidate class whose method is called from the test case. Using the above information, we prepare a reverse index mapping to obtain a list of test cases corresponding to each production class. We use the mapping to retrieve the number of test smells corresponding to each production class. As explained above, we compute the testability smell density and test smell density at the class level. Finally, we compute the Spearman’s correlation coefficient between testability smell density and test smell density.

6.1.2 Results

Figure 5 shows the scatter plot between testability smell density and test smell density in the software systems under examination. We obtain the Spearman’s correlation coefficient \(\varvec{\rho }={\textbf {0.246}}\) (p-value \(< 2.2e-16\)); the coefficient indicates that testability and test smells share a low positive correlation.

Fig. 5
figure 5

RQ1. Correlation between testability and test smell density

We extend our analysis by computing the correlation between the density of individual testability smells and the test smell density per repository. We observe that law of Demeter violation shows the highest correlation \(\rho = 0.358\) (p-value \(< 2.2e-16\)) with test smells. On the other hand, the global state exhibits the lowest correlation \(\rho = 0.076\) (p-value \(< 2.2e-16\)). Hard-wired dependency and excessive dependency show correlation \(\rho = 0.328\) (p-value \(< 2.2e-16\)) and \(\rho = 0.248\) (p-value \(< 2.2e-16\)), respectively.

We also investigate the relationship at the class-level. We identify the test cases and corresponding test smells for each production class and compute the Spearman’s correlation between the normalized values of testability smells and test smells. We obtain \(\rho =0.050\) (p-value \(= 2.903e-08\)). The results indicate that testability smells and test smells do not share any correlation at the class-level granularity.

figure i

6.2 RQ2. Do testability smells correlate with test suite size?

6.2.1 Approach

RQ2 investigates whether and to what extent the presence of testability smells leads to fewer test cases. To study this relationship, we first compute the testability smells in all the considered repositories using DesigniteJava. In addition to smells, we use the tool to figure out the total number of test cases in a repository; the tool marks each method as a test method or a normal non-test method. For simplicity, we treat each test method (i.e., a method with a @Test annotation) as a test case. Next, we compute the testability smell density as described in RQ1 and the test density of each repository. Test density is a normalized metric that represents the total number of test cases per one thousand lines of code. We compute the Spearman’s correlation coefficient between the testability smell density and the test density for each repository.

Similar to RQ1, we explore the correlation at the class-level. For the analysis, as we explain in RQ1, we first find out the production classes that a test case is testing. With this information, we prepare a mapping between production classes and their corresponding set of test cases. We use the mapping to retrieve the number of test smells corresponding to each production class. We compute testability smell density and test case density at the class level. We measure the correlation between testability smells and the number of test cases using Spearman’s correlation coefficient.

6.2.2 Results

Figure 6 presents a scatter plot between the testability smell density and the test density of the selected repositories. We obtain \(\rho = -0.033\) (p-value \(=0.308\)), which is not statistically significant. Therefore, testability and test smells do not correlate with each other.

Fig. 6
figure 6

RQ2. Correlation of testability smells with test density

We extend our analysis by segregating the repositories into two categories by size. In the first set, we put all the repositories that have less than 50, 000 lines of code and, then, we put the rest of the repositories in the second set. We carry out the same analysis on both of these sets. We obtain \(\rho = -0.009\) (p-value \(= 0.789\)) for the first set and \(\rho = -0.050\) (p-value = 0.492) for the second set between testability smells and test density. The obtained results are not statistically significant.

Furthermore, we observe the relationship between testability smells and the number of tests at the class-level granularity. We compute the total number of testability smells for each non-test class and figure out the total number of tests written for the class. In the computation, we did not include the classes where the number of tests for the entire project is zero indicating that either the test cases are not written for the project or the test cases are written using a framework other than JUnit. We perform the above step to reduce the noise in the prepared data. We obtain \(\rho =-0.179\) (p-value \(< 2.2e-16\)) as the correlation coefficient. The results clearly show that testability smells show a very low correlation with the size of the test suite.

figure j

6.3 RQ3. Do testability smells cause more bugs?

6.3.1 Approach

RQ3 aims to investigate whether and to what extent testability smells relate to, and even cause, bugs in a given software system. To answer this question, we choose five subject systems manually and perform a trend analysis by extracting information from multiple commits for each of these subject systems.

We use the following protocol to identify the subject systems for this research question. First, we obtain a sorted list of repositories by their number of commits in descending order from our selected initial set of repositories (see Section 5) using the GitHub api. The intent here is to choose repositories with a rich commit history to facilitate detailed trend analysis. Then, we manually check these repositories one by one to assess whether a repository uses GitHub issues and whether these issues are labelled as “bugs”. In addition, we execute DesigniteJava on the latest commit of each of these repositories to ensure that it does not take too long to run, as, otherwise, it might be prohibitive for us to run it to analyze the entire repository containing multiple (hundreds of) commits. Finally, we select the first five repositories that satisfy the above criteria, which are: Magarena,Footnote 9XP,Footnote 10Rundeck,Footnote 11MyRobotlab,Footnote 12 and Ontrack.Footnote 13

In order to perform a trend analysis, it is crucial to select a suitable set of commits from each of these repositories. One common way is to select commits at a fixed interval either by commit number (for example, every 100\(^{th}\) commit) or by commit date (for example, one commit per month). However, such a mechanism may result in a skewed set of commits where either significant changes in the commits are missed or commits with hardly any change are analyzed. To overcome this limitation, Sharma et al. (2020) proposed a commit selection algorithm where commits are selected based on the amount of changes introduced in a commit w.r.t. the previous selected commit. In this work, we follow this strategy to select commits for each of the five identified repositories. Specifically, we first obtain all the commits in a repository in the main branch, and then we choose the first and the last commit to get started. Then, we compute five code quality metrics (i.e., weighted methods per class (wmc), number of children (NC), lack of cohesion among methods (lcom), fan-in, and fan-out) and identify changed classes based on the changes in any of these metrics. If the changed number of classes between two analyzed commits differ by a threshold (set to 5%), we consider the commit having significant changes (Sharma et al. 2020). We then pick the middle commit (i.e., the commit between the currently selected two commits) and repeat the process until we find commits with non-significant changes (Sharma et al. 2020).

Once we identify the set of commits for the trend analysis, we detect testability smells in each of the selected commits for each repository by using our DesigniteJava tool. Also, we identify the total number of open and closed issues that has the tag “bugs” when the commit was made. The GitHub api does not provide a direct way to figure out issues at the time of a specific commit. To identify issues at the time of a specific commit, we first fetch the issues that are open (or closed) since the time of a commit and subtract them from the total open (or closed) issues at present. It gives us the total number of open (or closed) issues at the time of a specific commit. We record this information along with the total detected testability smells for each selected commit. Using this information, we compute the Spearman’s correlation coefficient between the total detected testability smells and the total number of issues (i.e., the sum of open and closed for each considered commit).

We also carry out a causal analysis to figure out whether testability smells cause bugs. We use Granger’s causality (Granger 1969) analysis for this purpose. The method has been used in similar studies (Couto et al. 2013; Palomba et al. 2018; Sharma et al. 2020) to explore the causal relationship within the software engineering domain. Equation (1) presents Granger’s method mathematically.

$$\begin{aligned} a(t) = \sum _{j=1}^{k}f(s_{t-j}) + \sum _{j=1}^{k}f(b_{t-j}) \end{aligned}$$
(1)

In our context, time series S and B represent the testability smell density i.e., total number of testability smells per one thousand lines of code, and reported bugs computed over a period of time. Variables \(s_t\) and \(b_t\) represent testability smell instances and total reported bugs at time t. If the predictions of variable b with the past values of both s and b are better than the predictions using only the past values of b, then testability smells cause the bugs.

In such analysis, we must ensure the stationary property of a time-series before analyzing it and drawing conclusions based on that. A time-series is stationary if its statistical properties, such as mean, variance, and autocorrelation, are constant over time (Cox and Miller 1965). A non-stationary time-series shows seasonal effects, trends, and fluctuating statistical properties changing over time. Such effects are undesired for the causality analysis and thus a time-series must be made stationary before we perform the causality analysis. We carried out the augmented Dickey-Fuller unit root test (Fuller 1976) to check the stationary property of our time-series. Initially, our time-series was non-stationary. There are a few techniques to make a non-stationary time-series a stationary one (Kwiatkowski et al. 1992). We addressed this issue by applying a difference transformation, i.e., subtracting the previous observation from the present observation for all columns. Techniques such as differencing, that we applied, help stabilize the mean of a time series by removing changes in the time series, and therefore eliminate or reduce the non-stationary nature of the series (Kwiatkowski et al. 1992). After the transformation, we obtain a stationary time-series that we confirmed by performing the augmented Dickey-Fuller unit root test again. Finally, we carry out the causality analysis using (1).

6.3.2 Results

Table 4 presents the results of the experiment. The number of analyzed commits ranges between 38 (for XP) and 180 (for Rundeck). The size of the selected repositories varies between \(\approx \) 71 kloc (for Magarena) to \(\approx \) 181 kloc (for XP) as measured for the most recent analyzed commit. We compute the total number of testability smells in each selected commit as well as the total reported (open and closed) issues marked as bugs at the time of the corresponding commits for each of the selected repositories individually. The table shows the total number of testability smells detected in the most recent analyzed commit. We compute the Spearman correlation coefficient between the reported issues and testability smell density. We observe mixed results for the correlation analysis; two repositories show strong, one repository shows moderate, and one repository shows low correlation. We observe that the correlation coefficient is not statistically significant for the Ontrack repository.

Table 4 RQ3. Correlation and causation relationships between testability smell density with the number of reported bugs
Fig. 7
figure 7

Box-plots of the categories of testability smell density with test density

The last column of Table 4 presents the results of the causality test. Each cell in the column shows the p-value computed for the causal relationship of testability smells with the reported bugs. The results for all the analyzed repositories show that testability smells do not cause bugs as all the obtained p-values are greater than 0.05.

figure k

7 Implications and Discussion

The results of our first research question reveal that there is no correlation between testability smells and test smells. This suggests that writing good-quality test code is possible even with poor testability, at least for all testability smells considered herein. The results also indicate that either the difficulty in writing tests due to the considered testability smells is orthogonal to test smells, or existing testing frameworks, e.g., mocking frameworks, make it easier to overcome the challenges posed by poor testability. Researchers may investigate further the influence of tools’ features, such as mocking, to facilitate testing despite poor testability.

We explore the effect of testability smells on test suite size, represented by the number of test cases, in RQ2. Figure 7 shows box-plots of the categories of testability smell density with test density. We divide the repositories into four categories C1 to C4 based on the value of the testability smell density. For example, repositories with a testability smell density of less than five are put into category C1. We observe that the median test density for the first category is the highest among all the categories and the test density dips for the category C2. However, against the common belief of developers, test density rises again in categories C3 and C4. The analysis further reaffirms that testability smells do not share a linear monotonic relationship with test density.

Our experiment to investigate the correlation of testability smell density with the number of reported bugs does not show a consistent strong relationship. The strong correlation in two repositories and the moderate correlation in a repository show that the density of testability smell increases as the size of the software grows since the total number of reported bugs always increases with time. Hence, a strong correlation implies that the rate of testability smells increases as the software systems grow.

In the context of our study, one might wonder about the relationship of testability smells with traditional code smells. Given the definition and scope, it is likely that some code smells are also considered testability smells if they impact testability. However, this interpretation is not uniquely applicable only in this context. For example, a violation of the ‘single responsibility principle’ may introduce incohesive class (or multifaceted abstraction) at design and ‘feature concentration’ smell at architecture granularity. Nevertheless, we perform an analysis of testability smell density with code smell density not only at the repository-level but also at a fine-grained granularity of class-level (where we compute the total number of smells for each class of the considered repositories). We use the DesigniteJava tool to detect code smells and testability smells. We compute the Spearman’s correlation coefficient between the normalized total number of smells. At the repository-level, we obtain \(\rho = 0.851\) (p-value \(< 2.2e-16\)) as the Spearman’s correlation coefficient. Similarly, we get \(\rho = 0.857\) (p-value \(< 2.2e-16\)) when we compute the correlation at the class-level. The strong correlation indicates that the presence of a large population of code smells is associated with the presence of a large number of testability smells and vice versa.

The elicited developers’ perspective clearly emphasizes the importance of testability smells and the potential negative impact on testing aspects. However, the empirical evidence observed in the study does not agree with the perspective. We observed that testability smells, at the class-level fine-grained granularity, do not correlate with test smells. Also, the smells do not show any influence on test density. Furthermore, the results show that testability smells do not contribute to a higher number of bugs. The results suggest that despite developers’ invaluable experience, their opinions and perspectives might need to be complemented with empirical evidence before bringing it into practice.

8 Threats to Validity

This section discusses the potential threats to the validity (construct, internal, and external) of our reported results.

Construct validity Construct threats to validity are concerned with the degree to which our analyses measure what we claim to analyze. In our study, we used our DesigniteJava tool to identify the four testability smells. However, the strategies used to identify testability smells may not capture all testability cases. To mitigate this threat, we thoroughly tested the tool using different cases for each smell, and also fine-tuned the tool based on testing. Then, we performed a manual analysis of the four testability smells on a complete project, namely wsc. The results of the manual validation show a very high recall and precision. Similarly, we also implemented support to detect test smells by following detection strategies proposed in the existing literature to identify test smells.

Internal validity

Internal threats to validity are concerned with the ability to draw conclusions from our experimental results. We carried out an online anonymous survey targeting developers by posting our survey on social media professional channels (Twitter, LinkedIn, and Reddit). Given the anonymity of the survey, we do not have any mechanism to verify the level of experience claimed by the participants. However, based on the quality of the responses provided by the participants, we believe that such a threat is mild. In addition, software developers participated in our online survey were not selected based on the repositories we analyzed. As a result, opinions of developers could be influenced by the repositories they usually contribute to and might not agree with our empirical results. To mitigate this, we did not target developers from specific repositories but rather expanded our participation range by posting invitations on online professional social media channels.

Agreement bias (or acquiescence bias) refers to the participants’ tendency to agree with a statement rather than disagreeing with it Toner (1987). We design our questionnaire in a neutral tone and provide options by using a Likert-scale to mitigate this threat. A similar threats to validity is participants’ acquaintance to the authors. To avoid this threat, we did not circulate the survey in our internal organization groups. Also, we restricted the sharing to professional social media channels and hence did not sharing the survey on our, for example, Facebook profiles or groups.

RQ3 investigates causality between testability smells and the number of reported bugs; the analysis reveals that the testability smells do not cause bugs. There are two possible threats to the conclusion. First, it is possible that the testability smells other than those considered in the study have a larger impact on bugs. However, though there could be many other testability smells, the considered smells are representative as shown by our developers’ survey. Second, the study only considered reported known bugs. It is possible that there are many more unknown bugs that may influence the results and conclusion of the experiment.

External validity External threats are concerned with the ability to generalize our results. The \(1,115\) GitHub repositories analyzed in this paper were selected using well-defined criteria from the RepoReapers dataset. However, some repositories might have switched from public to private or no longer exist on GitHub, which might affect the criteria used to select repositories in this paper. In addition, all the selected repositories contain software written in Java, which might affect the generalizability of our findings. The major reason for focusing on Java is that the majority of research on software quality analysis has been done on Java code, and hence we can leverage existing tools to achieve the goals of our study. Along the same lines, the implemented test smell detection works only if the tests are written using JUnit. The rationale behind this decision is that JUnit is the most commonly used testing framework for Java. We encourage future research to expand the analyses conducted in this paper to software written in different programming languages.

9 Conclusions

This study explores practitioners’ perspectives about testability smells as well as experimental evidence gathered via a large-scale empirical study on \(1,115\) Java repositories containing approximately 46 million lines of code in order to better understand the relationship of testability smells with test quality, number of tests, and reported bugs.

Specifically, the study surveyed software developers to elicit their opinions and perspectives about testability smells. The survey showed that software developers consider testability a factor that impedes software testing; the survey also revealed their strong agreement with the proposed testability smells. Then, we conducted an extensive empirical evaluation to observe the relationship between testability smells and test-related aspects such as test smells and test suit size. Our results show that testability smells do not correlate with test smells at the class granularity and with test suit size. Furthermore, we did not find evidence that testability smells cause bugs.

Our study has implications for both the research and industrial communities. Software developers often have strong opinions about software engineering concepts; however, experimental evidence may not support them in general. Specifically, this study shows that developers’ opinions about testability do not concur with the experimental evidence. Hence, opinions and perspectives must be complemented with empirical evidence before bringing into practice. This also highlight the importance of data-driven software engineering, which advocates the need and value of adopting design and development decisions supported by data. Researchers can use our tool to detect testability smells to further evaluate and confirm our observations. Also, researchers may propose additional testability smells and investigate their collective impact on other relevant testing aspects, such as testing efforts.