1 Introduction

There is extensive literature on systematic software reuse (or in-short reuse); its purpose and promises, how to develop for and with reuse, technical/managerial/organizational aspects, measuring reuse rate and Return-On-Investment (ROI), and success and failures of reuse practices. However, little research has accumulated reported results in a review type paper like this, where the goal is collecting and appraising evidence and finding research gaps for future studies.

In this paper, we summarize empirical quantitative evidence from industry on reuse appeared between 1994 and 2005. The question guiding the review is: “To what extent do we have evidence that software reuse leads to significant quality, productivity or economic benefits in industry?.” Specifically the aim of this review is to (1) find and organize the quantitative empirical evidence from industry related to the review question, (2) evaluate the quality of reporting, identify metrics, data collection procedures and analysis methods, (3) summarize the findings, and (4) identify gaps for future research. The review defines five research questions that are answered through the evidence. The possible audience of this review are two groups: those who plan future studies on reuse may learn from experience to improve the state of research, and those who seek evidence on reuse benefits for decision-making may use this review as a reference.

The remainder of the paper is structured as follows. Section 2 provides definitions of concepts used in the review. Section 3 presents the review process in terms of research questions, the framework for performing the review, paper selection criteria and validity threats. Section 4 gives an overview of the reviewed papers and Sections 57 asses the papers regarding metrics used, data collection and analysis, and summarizes findings. Section 8 discusses shortcomings in reuse research and ideas for future research and Section 9 presents lessons for improving research. The review is concluded in Section 10.

2 Concepts

There is a diversity of definitions in literature on reuse and types of studies, and our purpose here is not to review the definitions, but to present what we mean when using these terms.

2.1 Software Reuse

Software reuse is the systematic use of existing software assets to construct new or modified ones or products. Software assets in this view may be source code or executables, design templates, free standing Commercial-Off-The-Shelf (COTS) or Open Source Software (OSS) components, or entire software architectures and their components forming a product line or product family. Knowledge may also be reused and knowledge reuse is partly reflected in the reuse of architectures, templates or processes. Developing components so that they become reusable is called developing for reuse, while developing systems of reusable components is called developing with reuse (Karlsson 1995). Both these are covered in the review. Reusability is a property of a software asset that indicates its probability of reuse (Frakes and Kang 2005). Ad-hoc reuse in this review means that reuse is opportunistic and not part of a repeatable process, as opposed to systematic reuse; meaning planned. Glass (2002) discusses that “ad hoc” originally means “for this” or “suited for the task at hand,” and is different from not planned or not repeatable. However, the term is widely used in other literature as opposed to systematic reuse, and this review uses it as well.

Almost any software today is built on software developed by others; for example operating systems, programming languages’ libraries, CASE tools, debuggers, desktop applications, databases or application servers. Reuse in this review does not cover reuse of the above software which is not considered to be developed by the company itself, but would be purchased or obtained as OSS products for software development.

Sometimes reuse refers to developing new releases of assets or products based on the previous releases (Basili 1990). We call this release-based or incremental development, which is in fact a maintenance and evolution activity. Frakes and Terry (1996) call this for “carry-over reuse.” This type of reuse is not included in this review.

2.2 Study Types

We decided to include all studies reporting quantitative results from industry related to reuse in the review and then classify the study type, leaving out surveys and papers with discussion but no hard data. The study type is important information in each study since it communicates what is expected from a study and how the evidence should be evaluated. However, a search of literature for study types showed that there are not consistent definitions and/or the definitions are not communicated well. Therefore, we have to define our perspective of study types.

One definition of study types that is applied on empirical research is given by Zannier et al. (2006) (see the paper for a complete list of their references). Table 1 also shows these definitions and some other that we found.

Table 1 Study types and their definitions

Zannier et al. (2006) analyzed a random sample of 63 papers published in 29 ICSE (the International Conference on Software Engineering) proceedings since 1975 using the above classification. Authors of only 25 papers had defined their study type, and Zanneir et al. give both authors and their perspective of the study types. We use their definitions but also add that when studies are performed at a single point in time, they are called cross-sectional, as opposed to longitudinal studies.

A case study may be comparative, and Kitchenham and Pickard (1998) describe three methods of comparison in a quantitative case study, which are (a) comparing the results of using a new method with a company baseline, (b) comparing components within a project that are randomly exposed to a new method to others or within project component comparison, and (c) comparing a project using a new method to a sister project that uses the current method or sister-project case studies. An alternative sister-project design is developing a product twice using different methods or replicated product design. This review found examples of (b) and (c) in different types of studies, and we hence call the method of comparison for component-comparison (components may be from one or several products) and sister-project comparison (including replicated product design).

2.3 Objects and Subjects of Study, Variables and Measurement

We use the definitions of variables, treatments, objects and subjects of study of Wohlin et al. (2000). The object of study is the entity that is studied; for example, a program that shall be developed with different techniques. The people that apply the treatment are called subjects, for example the developers of a software product. The characteristics of both the objects and the subjects can be independent variables in a study. All variables that are manipulated and controlled are called independent variables. Those variables that we want to see the effect of the changes in the independent variables are called dependent variables. A treatment is one particular value of an independent variable. The treatments are being applied to the combination of objects and subjects. A confounding factor is a factor that makes it impossible to distinguish the effects from two treatments from each other, such as different skills of developers.

Measurement is here used for the activity of measuring a property of software, a metric is the property of software that is measured; for example software size in Lines of Code (LOC), while a measure refers to the symbol or number that is assigned to the property by the activity of measurement.

For case studies, Yin (2003) recommends defining unit of analysis or what the “case” is; for example individuals, a product or an organization, and sources of evidence which may be documentation, archival records, interviews, direct observations, participant-observation, and physical artifacts. Using multiple sources of evidence is strongly recommended to increase reliability of a case study. Since this terminology is not used in the papers, we have not summarized papers regarding their sources of evidence.

3 The Review Process

This section presents the review framework and the five research questions that are derived from the review question, the paper inclusion criteria and the validity threats of the review.

3.1 Review Framework and Research Questions

In this review, we ask “To what extent do we have evidence that software reuse leads to significant quality, productivity or economic benefits in industry?”. This research was initiated related to a study we performed on quality benefits of reuse. However, in spite of our expectation for overwhelming evidence, the search for papers showed that reported results from industry are surprisingly few. In addition to the sparseness of the results, the question of practical significance is rarely discussed. We also searched for experiments in artificial settings, which only added one student experiment to the search results (Basili et al. 1996) that is not included in this review.

The formulation of the review question follows recommendations by Dybå et al. (2005) for collecting evidence as answer to questions. Questions should be well-partitioned into intervention, context and effect. In this review, the intervention is “software reuse,” the context is “industrial settings” and the effect is “changes in quality, productivity or Return-On-Investment (ROI).” The intervention is either directly or indirectly measured in reuse metrics, while the effect is measured in dependent variables such as problem density. Figure 1, inspired from Wohlin et al. (2000) and Dedrick et al. (2003), shows the framework leading this review.

Fig. 1
figure 1

The review framework

We have also added the appraising view to the question by asking the significance of the results. Specifically we ask the following research questions:

  1. RQ1

    What types of studies are performed and what data are reported on the reuse approaches?

  2. RQ2

    Which metrics are used for reuse and its effects?

  3. RQ3

    How are quantitative data reported and analyzed?

  4. RQ4

    What are the findings and what theory may be developed based on the findings?

  5. RQ5

    What are the shortcomings regarding reuse research?

RQ1 to RQ5 are discussed in Sections 48 successively. Since we have not found any such review of earlier research, we need to perform a detailed exploratory analysis of papers. Our guide in performing this review has specially been: Webster and Watson (2002), Kitchenham (2004), Kitchenham et al. (2002), Gregor (2002) and Pickard et al. (1998).

3.2 Paper Inclusion Criteria

The review concentrates on studies whose results are published in peer-reviewed journals and conferences. Additional sources would be books and technical reports (for example, Hallsteinsen and Paci 1997) that are not included in the review.

We searched the ACM digital library and the IEEE Xplore which also include many conference proceedings, Empirical Software Engineering Journal, Journal of Systems and Software, Journal of Information Science, MIS Quarterly (MISQ) from September 1994 (online), IEEE Transactions of Software Engineering (TSE), IT Professional, ACM Computing Surveys (CSUR), and the Journal of Research and Practice in Information Technology (from 2003 online). We searched the above sources with keywords “reuse,” “reuse benefits” and “reuse case study.” To assure better coverage, proceedings of the International Conference on Software Reuse (ICSR), the IEEE International Conference on Software Maintenance (ICSM) since 1995 online, the International Software Product Line Conference (SPLC) started in 2001, the International Conference on Software Engineering (ICSE) since 1995 online, the International Conference on COTS-Based Software Systems (ICCBSS) started in 2002, MISQ and the IEEE Software magazine were manually checked. We searched for papers reporting quantitative results but will discuss their qualitative findings as well. We searched for case studies and experiments but excluded surveys.

We only reviewed papers published from 1994 to 2005. Frakes and Terry (1996) provide a survey of reuse metrics and models from earlier research, and Hallsteinsen and Paci (1997) summarize some earlier research. This review differs from the above sources with respect to giving an explicit selection criterion and the searched resources, appraising evidence and discussing significance, classifying studies and covering new research.

The review process identified eleven papers that match our selection criterion, all retrieved in full text, and which compared systematic reuse with ad-hoc or no reuse, or compared reused components with the non-reused ones. In addition to these papers, Ramachandran and Fleischer (1996) included data on reuse rate but no quantitative findings on benefits and is therefore not included in the review. We also found three papers on reuse of OSS components that are not included in the review either because they describe reuse of software for infrastructure or they lack quantitative data, or both reasons apply:

  • Madanmohan and Dé (2004) performed structured interviews with developers of some commercial firms to find how they use OSS software. They classified the products as being operating systems, middleware, databases and support software. The paper has no data on ROI or quality.

  • Norris (2004) writes that using OSS software for developing mission-critical software at NASA has reduced in-house effort and provided software with fewer bugs, without giving quantitative data.

  • Fitzgerald and Kenny (2004) report on cost savings using OSS software when developing an infrastructure system for a hospital. Phase 1 of the project covered generic products such as an email system, a content management system and desktop applications and showed significant savings. Phase 2 would cover more specific products but at the time of publication, Phase 2 was still under planning and the savings were so far only estimated in the paper.

The final list therefore includes the following eleven papers ordered here after the year of publication: Lim (1994), Thomas et al. (1997), Frakes and Succi (2001), Succi et al. (2001), Morisio et al. (2002), Tomer et al. (2004), Mohagheghi et al. (2004),Footnote 1 Baldassarre et al. (2005), Morad and Kuflik (2005), Selby (2005), and Zhang and Jarzabek (2005).

3.3 Threats to the Validity of the Review

The main threats to validity of the review are:

  • Uncovered publication channels for external validity: We chose the journals, conferences and libraries that in our experience publish major research results on software reuse. Additional search may add new papers, which needed more effort. Giving the inclusion criterion and the publication channels allows for validation and extension of the review.

  • Undetected papers for external validity: We searched with a few keywords but to improve the detection process, we manually checked several publication channels. From the reviewed papers, only one of them does not include the word “reuse” in the title and was detected by manual check; i.e. (Morisio et al. 2002), which is an indication that we may have missed some papers but the extent is limited.

  • Publication bias for internal validity: Probably success cases of reuse are published more often than failures, and significant results may be published more often than when the results are not considered as significant.

  • Researcher bias for construct validity: Both authors have experience with industrial software reuse. We compared papers to determine relevant classifications and searched literature for definitions. The classifications and conclusions reflect our knowledge and opinion. The researchers have done their best to provide an objective review when analyzing research and we have presented all the results in the review to allow discussion and future extension. The main analysis is performed by the first author and the results are discussed with the second author.

4 Answering RQ1—What Types of Studies are Performed and What Data are Reported on the Reuse Approaches?

In this section, we review the studies regarding the object of study, study type, domain, scale, publication channel, the year of publication, and the data reported on the reuse approaches.

4.1 Objects, Types of Studies, Scale, Publication Channel and Year

Appendix A gives an overview of the eleven reviewed papers, ordered after the year of publication. It also shows objects and type of studies. We applied the classification discussed in Section 2.2 while most papers also define the study type. The field “Agreement” shows whether there is agreement between the review’s and authors’ perspective of the study type. KLOC stands for Kilo Lines of source Code as a measure of software size. When a paper does not provide information on an attribute, the label “−” is used. The conclusions may be summarized as:

  • Study type: Thomas et al. (1997) and Selby (2005) do not discuss the study type. For the others, we have shown the study type both from the authors’ and the review’s perspective. The main differences in classification are: (1) Quasi-experiments and experiments from the authors’ perspective (Frakes and Succi 2001; Succi et al. 2001; Zhang and Jarzabek 2005) are classified as (exploratory) case studies and experience reports in this review because of the lack of clear hypotheses and the little degree of control applied by the investigators. (2) Two case studies from the authors’ perspective (Lim 1994; Tomer et al. 2004) are classified as experience report and example application in this review. The term “case study” is often used in literature to cover all studies where some data on “cases” are presented. The review has identified four case studies (Thomas et al. 1997; Mohagheghi et al. 2004; Baldassarre et al. 2005; Selby 2005), three exploratory case studies (Frakes and Succi 2001; Succi et al. 2001 Footnote 2; Morisio et al. 2002), three experience reports (Lim 1994; Morad and Kuflik 2005; Zhang and Jarzabek 2005), and one example application (Tomer et al. 2004).

  • Publication channel: Four papers are published in various conference proceedings and seven in journals, where IEEE Trans. Soft. Eng. has published four of the papers.

  • Year: 2005 has been the most productive year with four papers.

Succi et al. (2001) and Tomer et al. (2004) have not reported programming language, and four of the papers (Tomer et al. 2004; Baldassarre et al. 2005; Morad and Kuflik 2005; Zhang and Jarzabek 2005) have not reported the size of products or the reusable assets, where in Zhang and Jarzabek (2005) it is not clear whether 4.5 KLOC is the total size or the mean size of applications. There is variation in domain and programming languages. Based on the given software size and our conclusions, we classified the studies according to their scale into:

  • Small-scale studies (S): Five studies (Frakes and Succi 2001; Morisio et al. 2002; Tomer et al. 2004; Morad and Kuflik 2005; Zhang and Jarzabek 2005) cover a few reused software assets or small products.

  • Medium-scale studies (M): Three studies (Lim 1994; Succi et al. 2001; Baldassarre et al. 2005) cover larger products than the first group but less or around 100 KLOC, and still the objects of study are few.

  • Large-scale studies (L): Three studies (Thomas et al. 1997; Mohagheghi et al. 2004; Selby 2005) cover products with software size more than 100 KLOC or cover a large number of objects.

4.2 Reuse Approaches

Appendix B presents data on the reuse approaches. The definitions of terms and an overview for each field are given in Table 2.

Table 2 Summary of reuse approaches

Morisio et al. (2000) have a similar list for comparing projects with a few additional factors such as whether there exists an explicit reuse process or when the reusable assets are developed (on demand or beforehand). Most of the papers in this review did not include data on these factors.

4.3 Summary of the Section and Answering RQ1

All the 11 studies in this review are classified as observational studies with four case studies, three exploratory case studies, three experience reports and one example application. We conclude that our search has not returned any experiment or quasi-experiments in industry. Four studies are sister-project studies; comparing projects or products and in one case replicated product design. The closest to experimentation is comparing similar projects in size and domain, developed within the same company where developers have comparable skills or may be randomly assigned (Succi et al. 2001; Baldassarre et al. 2005; Morisio et al. 2002), or redeveloping a product with systematic reuse (Zhang and Jarzabek 2005). This review of literature has not found any study with random assignment of treatments to objects, and only Baldassarre et al. (2005) report random assignment of developers to the two projects. Zannier et al. (2006) report also that they did not find any example of simple or stratified random sampling in a sample of ICSE papers. The selection of objects (products or components) is due to access to data, which may be classified as investigator-selected or convenience sampling. On the other hand, Zannier et al. found the absolute majority of studies being self-confirmatory; i.e., the authors played a role in the development of the product of study. We did not find support for that in the review and only in three cases (Lim 1994; Mohagheghi et al. 2004; Morad and Kuflik 2005) the authors were employees of the companies. However, for several studies the relation of investigator to the case was not clearly stated.

The scale of the studies varied (small-scale studies are most represented) and so are the approaches to reuse, domain and reuse rate. Most studies involved systematic reuse or compared it with ad-hoc reuse. Units of reuse varied as well but only reuse of source code was measured. Only in the three small-scale studies, reused components were developed externally or before the application, and only Morad and Kuflik (2005) give an example of reuse of three OSS assets, where savings in person-hours are estimated. A few studies do not report size of the products or components, programming languages or characteristics of their reuse approaches.

Only two studies report data from several releases of a software product or projects over time; i.e., Mohagheghi et al. (2004) and Lim (1994), where the first study evaluates components in several releases of one software product and the second study reports productivity gains over several years. Selby (2005) has collected data for several years of development, but does not present the data as releases of the same products, only from the same environment. Most studies in this review are therefore cross-sectional, and long-term effects of reuse are understudied.

5 Answering RQ2—Which Metrics are Used for Reuse and its Effects?

This section presents the metrics used in the studies; except for the metrics related to cost-benefit analysis, which are few and are presented in Section 7 together with the findings.

5.1 Independent Variables—Reuse Metrics

While attributes of reuse approach presented in Section 4.2 such as Development scope or characteristics such as domain may be used as independent variables, most of them happen to be fixed in the studies. The independent variables related to reuse are identified to be (see Table 3):

  • Development mode is a two-level factor and refers to whether development happens with or without systematic reuse in a project. It is used in sister-project studies.

  • Component origin is a multi-level factor and refers to whether a specific component (or any asset) is reused verbatim, slightly or extensively modified, or is newly developed. It is used in component-comparison studies.

  • Reuse rate quantifies the amount of reuse in a project or sometimes within a component. Reuse rate may be used as a dependent variables as well, but not in the reviewed literature.

Table 3 Independent variables and their definitions in the papers

5.2 Dependent Variables—Reuse Effects

We analyzed all the dependent variables used in the papers and identified four major groups: metrics related to software problems (Table 4), effort and productivity (Table 5), software change (Table 6) and module level metrics (Table 7).

Table 4 Metrics related to software problems
Table 5 Effort and productivity metrics
Table 6 Metrics related to software change
Table 7 Module-level metrics

We use the term “problem” in the review covering errors, defects and faults when the distinction is not clear or when we refer to all of them. There is inconsistency and vagueness in the use of the above terms. Mohagheghi et al. (2006) identified three questions to answer: what is covered (problem appearance or its cause), where problems are (software vs. system, executable vs. non-executable software such as requirements or documentation) and when problems are detected (the detection phase). These differences are visible in Table 4:

  • “Errors” are often counted for appearances of a problem. It may lead to changes in several modules or causes called for “defects” or “faults.”

  • Problems may be reported only for source code or all types of artefacts (executable and non-executable such as documents).

  • Problems may be recorded pre-release, post-release or in both phases.

All of the papers in Table 4 have used metrics related to problems as quality indicators. Since the discussion of the relation between dependent variables and quality (quality-in-use or other views such as process quality) is often missing in the papers, we do not get into this discussion and refer to this as a threat to construct validity in Section 7.5. Lim (1994), Thomas et al. (1997), Frakes and Succi (2001), Succi et al. (2001) and Mohagheghi et al. (2004) have used counts of problems or its density, while Thomas et al. (1997), Morisio et al. (2002) and Selby (2005) have included rework effort or isolation and correction difficulty as quality indicators.

Table 5 shows metrics related to effort and productivity. Increasing productivity and decreasing development time or effort are often given as the main motivations for reuse. Apparent productivity is calculated by dividing the total size of software to the total effort spent, while actual productivity is calculated by dividing the size of newly developed code to the total effort. One inherent problem with this approach is that integration of reusable assets or their modification takes effort which is included in the total effort, while their size is not included. With reuse, apparent productivity increases obviously. We discuss an alternative approach to measure actual productivity in Section 8.2.

Reducing the number of changes or the size of modified code should improve maintainability of a product, which motivates the use of software change metrics in Table 6.

Module-level metrics are used in two ways as shown in Table 7: Sister-project studies have evaluated whether development with reuse reduces product complexity, while two component-comparison studies (Thomas et al. 1997; Selby 2005) have used these metrics to characterize reuse at module level.

5.3 Summary of the Section and Answering RQ2

We identified three independent metrics which are Development mode, Component origin and Reuse rate, while other attributes of reuse presented in Section 4.2 may also be used as independent variables.

Metrics used to measure reuse effects are divided in four groups: metrics related to software problems (used in seven papers), effort or productivity (eight papers), software changes (four papers) and software module characteristics (five papers). In addition to these, Zhang and Jarzabek (2005) have measured improvement in performance in terms of memory usage and speed in running time. The papers have used 22 different dependent metrics, with very few examples of a common definition. In many cases, metrics are not well defined either, especially for metrics related to software problems. The diversity of metrics and definitions makes comparison of quantitative results difficult. The tables in this section may help future studies in choosing metrics so that several studies use common or comparable metrics. This is one precondition for combining evidence systematically or performing any meta-analysis in the field.

6 Answering RQ3—How are Quantitative Data Reported and Analyzed?

Appendix C shows how data are reported and analyzed in the eleven papers. We summarize the observations here.

Small-scale studies have used all the available data in the analysis and have mostly included the dataset in the papers. Except for Morisio et al. (2002) with hypotheses and a regression model, the other four small-scale studies have not defined hypotheses or applied inferential statistics. Medium and large-scale studies do not present all the data. However, data are fully analyzed and no sampling is done in these studies. In the three large-scale studies, the researchers have mined industrial databases and in two of them, data were inserted in relational databases for analysis. Appendix C also shows that the range of statistical tests is limited and there are no examples of data transformation (for example logarithmic transformations) in the studies.

Six papers have applied statistical tests and the authors have discussed preconditions such as normal distribution of data. However, when it comes to defining hypotheses and applying inferential statistics, we observe variances of the null ritual in four papers. Gigerenzer (2004) defines the null ritual in three steps:

  1. 1.

    Set up a statistical hypothesis of no difference or no correlation. Do not specify any alternative hypothesis.

  2. 2.

    Use 0.05 (or some other fixed value) as a convention for rejecting the null. Report the results as p < 0.05, p < 0.01 or p < 0.001; whichever comes next to the obtained value.

  3. 3.

    Always perform this procedure.

The null ritual is a modification of the Fisher’s null hypothesis testing, which may be summarized in the following three steps:

  1. 1.

    Set up a statistical null hypothesis. The null need not to be a nil hypothesis of no difference.

  2. 2.

    Report the exact level of significance and do not talk about accepting or rejecting hypotheses.

  3. 3.

    Use this procedure only if you know very little about the problem at hand. This procedure does not allow combining previous knowledge in inference; e.g., in contrast to the Bayesian approach.

Gigerenzer (2004) writes that statistical rituals eliminate statistical thinking and inferential statistics should be performed with care. Alternatives are Exploratory Data Analysis (EDA) techniques or reporting descriptive statistics and making conclusions without performing hypothesis testing. Even with well-defined null and alternative hypotheses, the selection of a 0.10 or 0.05 level of significance is a matter of personal choice, depending on whether a researcher is averse to missing a significant effect or to reporting a spurious effect. There is often no discussion of why a certain level of significance is selected in the papers or even why it varies within a single study.

Some papers in the review report the p-values while others do not or only report values over a certain threshold. Lim (1994) have discussed the results as significant for the company without applying inferential statistics, Succi et al. (2001) have reported p-values and considered the results as significant, and Mohagheghi et al. (2004) have discussed practical significance for the company in terms of saved effort. Other papers have used fixed thresholds for discussing significance without reflecting on the practical significance.

7 Answering RQ4—What are the Findings and What Theory may be Developed Based on the Findings?

This section summarizes the findings in terms of reuse economics, quality and productivity benefits, qualitative findings and validity concerns.

7.1 Reuse Economics and Savings

An overview of metrics and findings related to cost-benefit models is given in Table 8. The cost of reuse is assessed in the costs of developing reusable assets and integrating them. No costs are evaluated for training, infrastructure for reuse or setting up reuse repositories. Savings are assessed in development and rework effort.

Table 8 Metrics and findings on reuse economics

7.2 Findings Related to Quality and Productivity

The detailed findings related to quality and productivity in the four sister-project studies are shown in Appendix D. Kitchenham (2004) lists a set of criteria for quality assessment of studies. None of the studies claim to have selected their cases randomly from a population or have presented them as representative for a population. However, sister projects are claimed to be comparable with respect to domain, size, duration and developer skills (one developer in Morisio et al. 2002).

Five studies have compared reused components (verbatim or modified reused) with new code, sometimes within the same product and sometimes within a collection of products. We called these component-comparison studies. The studies of Tomer et al. (2004) and Morad and Kuflik (2005) are not included here since they only include data on effort savings. All data are analyzed in the studies. Appendix E summarizes the quantitative results of these studies.

In the sister-project studies, some control is applied by the investigators in the design of studies which ranks them higher in the chain of evidence. Component-comparison studies analyze available data on components with no control over the study. On the other hand, sister-project studies are all of small or medium scale, while three component-comparison studies have mined large industrial data bases. One observation of this review is that we have not found results in favour of no reuse or no systematic reuse, unless related to error correction difficulty (Thomas et al. 1997) and fault severity (Mohagheghi et al. 2004).

Two large-scale case studies; i.e. Thomas et al. (1997) and Selby (2005); have compared characteristics of reused modules with the non-reused ones. Both studies reported that modules reused verbatim were significantly smaller in size. Selby (2005) found that modules reused verbatim tended to be small, well-documented modules with little input–output processing. It also seems that these modules tended to be terminal nodes, because they had less interaction with other system modules but more interaction with utility functions. Thomas et al. (1997) reported that components reused verbatim from a domain library were smaller in size and had less external dependencies. There is however one difference: modules reused verbatim in Selby (2005) had simpler interfaces than other modules in terms of input–output parameters per LOC, while components reused verbatim in Thomas et al. (1997) had more parameters than either modified or new components. Thomas et al. (1997) explain the difference to be related to Ada or FORTRAN approaches to reuse.

7.3 Combining the Results for Quality and Productivity

We have a range of quantitative results that we want to appraise and combine. Pickard et al. (1998) describe three methods for combining the results of empirical studies:

  • Combining the p-values of studies which can reject a null hypothesis or fail to reject it, without giving any information on the actual effect.

  • Meta-analysis when the studies have used comparable metrics and reported a quantitative measure of effect size.

  • Vote-counting that does not depend on the actual effect size values and comparable metrics. Different outcomes of the hypotheses tests are categorized into significant positive effect, significant negative effect or non-significant effect. Each study then casts a “vote” in support of the above relationships and the numbers of votes are counted, thus becoming new scale that behaves like p-values. If the ratio of votes to the total number of studies is over a predetermined cut-off value, a relationship for the specific variable is identified. The method assumes that there is one underlying common phenomenon, for example when a single correlation coefficient is applied.

Each of the above methods has its requirements. The first one depends on p-values which are not reported in several studies. Meta-analysis requires homogenous studies and comparable metrics, while the studies in this review vary in type and metrics. Vote-counting requires an underlying common phenomenon but it allows testing very weak hypotheses. However, it may be the only method applicable when there are different metrics for a phenomenon or the reported information is very limited.

We decided therefore to perform a modified approach of vote-counting by categorizing the findings in “significant positive,” “significant negative,” “positive,” “negative,” and “no relation.” This way, we can evaluate the weight of evidence. By the weight of evidence, we mean the extent to which empirical results are consistent across a variety of studies (Pickard et al. 1998). We add the scale of the studies to evaluate whether reuse scales up, and finally significance in our vote-counting covers both practical and statistical significance depending on which one is discussed in the papers. Vote-counting is also discussed in Mohagheghi and Conradi (2006).

Table 9 shows a summary of findings. The dependent metrics are ordered after their popularity as given in the column “Metric included” (from 11 studies). In Table 8, “+ +” means a significant positive effect of reuse, “+” means positive effect, “0” means no relation or inconsistent results, “−” means negative effect and “− −” means significant negative effect of reuse. Note that Frakes and Succi (2001) do not discuss significance due to small sample size, Lim (1994) discusses practical significance, three studies (Tomer et al. 2004; Morad and Kuflik 2005; Zhang and Jarzabek 2005) are experience reports or example applications without discussion of significance, while the remainder of studies discuss statistical significance (and in cases practical significance as well). The three last columns in Table 9 show summary statistics indicating the number of studies that include a metric and how often the results were significant positive or negative.

Table 9 Vote-counting for reuse effects on software quality and productivity

Three dependent metrics are not included in Table 9 where the results were difficult to interpret (sources of error, decrease in time-to-market due to reduced development effort, and design effort). More than half of the dependent metrics are only used in single studies. The relations in Table 9 can be summarized for the independent or dependent metrics. When summarized for the independent metrics:

  • Development mode: When comparing development with systematic reuse to development without it across projects in four studies (Succi et al. 2001; Morisio et al. 2002; Baldassarre et al. 2005; Zhang and Jarzabek 2005), significant increase in apparent productivity is reported in two of them. In case of actual productivity and complexity of products, the results are inconsistent. Other benefits are only reported from single studies. From the above studies, Zhang and Jarzabek (2005) is an experience report with no discussion of significance.

  • Component origin: In component-comparison studies, systematic reuse (either verbatim, with slight modification or mixed with new code) is related to significant decrease in problem density in four studies (Lim 1994; Thomas et al. 1997; Mohagheghi et al. 2004; Selby 2005). From these, Lim (1994) is an experience report where only practical significance is discussed. Systematic reuse is also related to significant decrease in rework effort in two studies (Thomas et al. 1997; Selby 2005). Decrease of development effort per module or asset is studied in Frakes and Succi (2001), Tomer et al. (2004), Morad and Kuflik (2005) and Selby (2005), but only Selby (2005) discusses significance. Note that the definitions of metrics vary across studies. Other significant positive impacts are verified by single studies or the results are inconsistent. Two significant negative effects are also reported by single studies: Thomas et al. (1997) report significant difficulty in error correction for verbatim reused code since probably easier errors are already removed from reused code, and because developers have a greater familiarity with newly created components. We add that sometimes, it may be difficult to find the cause of a problem when a component is reused since the problem may be in integration or interaction with other components. Mohagheghi et al. (2004) report significant higher fault severity of reused components because problems in these components would lead to service unavailability or restarts. However, severity or difficulty ranking are often subjective and may not reflect the impact or importance of modifications.

  • Reuse rate: The studies of Frakes and Succi (2001) and Succi et al. (2001) are correlational; meaning that they evaluate the relation between reuse rate and the dependent variables. In the above studies, increased reuse rate is related to decrease in problem density, although not significant in Frakes and Succi (2001). Lim (1994) and Selby (2005) report some results related to reuse rate that are given under “Notes.” The increase in apparent productivity with increasing reuse rate is reported as significant in Lim (1994), while the four datasets in Frakes and Succi (2001) were inconsistent on the relation between the External Reuse Level (ERL) and apparent productivity.

We can also summarize the findings horizontally for each dependent metric, where we only summarize rows with several reported results and also give explanations from papers on why reuse is related to the outcome:

  • Defect, error or fault density is significantly reduced with introducing systematic reuse as confirmed by several medium and large-scale studies of different types, although the definitions of metrics and whether they study pre or post-release problems vary. Succi et al. (2001) observed that customer complaints reduced with increasing reuse rate when a domain library was in place, and not with reuse of a generic library, meaning that it is systematic reuse that had a positive impact. Reused components may be designed more thoroughly and be better tested, since faults in these components affect several products and the prevention costs are amortized over several products (Mohagheghi et al. 2004). Also because work products are used multiple times, the accumulated defect fixes results in a higher quality work product (Lim 1994). It is interesting to notice that reuse with extensive modification does not provide the reduction in problem density that the other modes of reuse. Selby (2005) reports that the modules reused with major revision had the highest fault correction effort, highest fault isolation effort and highest change correction effort, due to the loss of original design and abstractions. Thomas et al. (1997) did not observe any significant difference in defect density of extensively modified components versus new code, and modified components had more design errors.

  • Decrease in development effort per module or per asset is verified in four small-scale studies without discussion of significance, and in one large-scale study with significant results; i.e., Selby (2005). Selby means that reuse leads to less effort spent in design because creation of a new module requires the creation and evaluation of a new design, while reuse may require a walkthrough of existing design.

  • Rework effort is significantly reduced with systematic reuse. (1997) evaluated rework effort per LOC, Selby (2005) per module and Morisio et al. (2002) relative to the development effort. In the case of Selby (2005), rework effort is lowest for modules reused verbatim, but these modules are also smallest in size. The difference is also significant for modules with slight revision. Evidence is obtained both from sister-project and component-comparison studies of different scales. The fewer number of problems or lower problem density reduces the rework effort (Thomas et al. 1997). Morisio et al. (2002) write that more difficult tasks are probably already performed by framework designers.

  • Apparent productivity improves significantly with systematic reuse (Lim 1994; Morisio et al. 2002; Baldassarre et al. 2005), and the positive relation with reuse rate is reported in Lim (1994). Evidence is obtained from two small or medium-scale sister-project studies and one medium-scale component-comparison study. Because the work products have already been created, tested and documented, apparent productivity will increase. However, increased productivity does not necessarily shorten time-to-market because reuse must be used effectively on the critical path of a development project (Lim 1994).

  • Results regarding actual productivity are inconsistent. Baldassarre et al. (2005) report that actual productivity was not significantly different between the two projects developed with systematic or ad-hoc reuse. Morisio et al. (2002) report increase in actual productivity and relates it to learning.

  • Results regarding complexity are inconsistent. Baldassarre et al. (2005) report significant decrease in complexity with reuse and mean that without systematic reuse, a software system becomes more complex and more difficult to maintain. Zhang and Jarzabek (2005) and Morisio et al. (2002) did not observe any decrease in complexity when applying systematic reuse. Note that the definitions of the metric varied as shown in Table 7. All the studies are sister-project studies.

7.4 Qualitative Findings

Although this review focuses on studies with quantitative findings, we give an overview of a few reported qualitative findings in the papers here:

  • Reuse allows a company to use personnel more effectively because it leverages expertise (Lim 1994). More experienced personnel can be assigned to develop the reusable assets.

  • Selby (2005) reported that larger projects reuse more with modification than smaller ones since scale may motivate reuse.

  • Mohagheghi et al. (2004) reported that a reusable architecture leads to clearer abstraction of components. Reuse and standardization of software architecture and processes allowed also easier transfer of development in the conditions of organizational changes (Mohagheghi and Conradi 2007).

  • Morad and Kuflik (2005) reported that reuse adoption was slower than expected and the management hesitated to assign resources to the reuse team.

7.5 Validity Threats Discussed in the Papers

The validity of a study is the degree of confidence in inferences made from the data; i.e. inferential quality. Only five papers have discussed validity threats at all: Thomas et al. (1997), Succi et al. (2001), Morisio et al. (2002), Baldassarre et al. (2005) and Mohagheghi et al. (2004). We discuss the four classes of validity threats below.

Construct validity is concerned with whether the selected metrics reflect the intervention and effects; i.e., “right metrics.” Morisio et al. (2002) discuss that size and complexity were known from earlier studies while net rework effort was selected as a quality indicator since it is integrated in the effort model. Mohagheghi et al. (2004) discuss that fault density is used to compare the quality of components within the same environment, and is widely used. In the same study, the rate of modified code between releases (code volatility) was selected since less modified code has fewer faults. Succi et al. (2001) found a high correlation between two of the metrics (External Reuse Level and External Reuse Frequency), meaning that they are not orthogonal. None of studies have validated the selected metrics for their discriminative power, predictability or repeatability as recommended by Schneidewind (1992). Selection of metrics is often constrained by the available data. The relation between the selected metrics and quality is not well-discussed either. For example, Fenton et al. (2002) discuss that the number of detected problems is a function of both test effectiveness and potential problems, and few problems pre-release may indicate poor testing or high quality software. We notice that 3/6 studies using problem density have not even discussed whether they count pre- or post-release problems.

Conclusion validity for statistical analysis is concerned with whether the relationship between intervention and outcome is of statistical significance; i.e. “right analysis.” As discussed in Section 6, the conclusion on significance is in many cases based on fixed thresholds.

Internal validity is concerned with whether the observed relation is a causal one or “right data” are collected, and is also a condition for external validity. It is difficult to discuss cause-effect without manipulation in controlled experiments and removal of confounding factors. Another view of causality given by Mill and discussed in Gregor (2002) is that: (a) the cause has to precede the effect in time, (b) the cause and effect has to be related, and (c) other explanations of the cause-effect relation have to be eliminated. In cross-sectional studies in which all the data are gathered at one time, the researcher may not even know if the cause precedes the effect (Shadish et al. 2001). Adding comparison groups and pre-treatment observations to case studies clearly improves causal inference (same place). We identified three sister-project studies with some degree of control applied by the investigator (Succi et al. 2001; Morisio et al. 2002; Baldassarre et al. 2005). These studies discuss internal validity in relation to the study design, as shown under confounding factors in Appendix D. The question is whether this discussion is enough for establishing causality. Other confounding factors discussed in the papers are the impact of size (Thomas et al. 1997; Succi et al. 2001; Mohagheghi et al. 2004), complexity of modules or their interfaces (Thomas et al. 1997; Selby 2005), programming languages (Mohagheghi et al. 2004), developer skills (several studies), learning (Morisio et al. 2002) and differences in the functionality of components (Mohagheghi et al. 2004). The three experience reports (Lim 1994; Morad and Kuflik 2005; Zhang and Jarzabek 2005) and one example application (Tomer et al. 2004) do not include discussion of confounding factors.

External validity of the results should be discussed to evaluate whether the results are generalizable to a population, other contexts (“right context”) or to theory. Studies in this review are from industry, although cases are not claimed to be representative. Succi et al. (2001) write that the major results described in the paper can be extended to the underlying normal population (what is it?). The major limitation to external validity in Morisio et al. (2002) is discussed to be the employment of a single subject in the study, who cannot be representative for all developers. Even with valid results, the set of projects with similar size, domain, language or development methods is not well defined or so small that generalization outside the companies is difficult, and similarity of projects is not easy to assess. Industrial studies in this review have not been performed by taking samples from a population or selecting cases based on pre-defined criteria, other than access to data. Other views of external validity should therefore be sought. Lee and Baskerville (2003) propose generalization to theories or models (an example in Mohagheghi et al. 2004). Another view of generalization is evaluating the weight of evidence in the context of reader’s experience and how the results may be valuable in enhancing the evidential force to encourage a technology or approach.

In addition to the inferential quality, it is necessary to discuss data quality and reliability. Only Mohagheghi et al. (2004) discuss missing and inconsistent data due to the process of reporting problems and changes that do not enforce developers to enter necessary data. We did not find a discussion of data quality in the other papers. Pfleeger (2005) discusses some characteristics in evaluating credibility of single studies such as sensitivity to errors, quality and duration of observations, the expertise of those conducting and reporting studies, and their interest in the results. Yin (2003) recommends being careful to ascertain the conditions under which documents or archival records are generated and for which purpose.

7.6 Summary of the Section and Answering RQ4

We summarize the findings in several aspects:

Reuse savings

In spite of the variety of cost-benefit models (Lim 1996 compares 17 of them), we have little empirical evidence from industry on the actual economic benefits of reuse. The studies of Tomer et al. (2004) and Morad and Kuflik (2005) have compared scenarios of reuse, Thomas et al. (1997) and Mohagheghi et al. (2004) have evaluated savings in rework effort, while Lim (1994) is an exception here with presenting long-term data on savings.

Organizational impacts

Quantitative findings are rarely reported together with organizational impacts and feedbacks to industry. We may think of several reasons: Most analyses are performed by outsiders, data were collected or analyzed after the projects had finished, and the researchers did not perform long-term studies. The three studies that have analyzed large reuse programs (Thomas et al. 1997; Mohagheghi et al. 2004; Selby 2005) have actually mined industrial data repositories. Even in the sister-project studies, it is not reported whether the better performance of a reuse-oriented process had any impact on industry decisions or their development processes. Pfleeger (1996) writes that quality metrics per se, such as performance measures of defect rates, make no explicit strategic or economic statement. It is important to relate the results to the industry settings since “quality by itself is no longer a strategy that will ensure a competitive advantage. We must use quality intelligently, as one component of the overall business strategy.”

Reusable assets

When reusable assets are on the level of modules or functions, smaller and less dependent software modules are more often reused as confirmed by two papers. However, we found examples of reuse of large-grain building blocks as well, such as components in a layered architecture (Mohagheghi et al. 2004), product line architectures (Zhang and Jarzabek 2005) or OO-frameworks (Morisio et al. 2002). For large-grain components, the researchers write that reusable assets may encapsulate more difficult design; i.e., leverage of expertise.

Combining the quantitative results

We applied the vote-counting approach to combine the quantitative results, where the goal was to identify where positive or negative results are reported, and where they are significant. When it comes to productivity, significant increase in apparent productivity is verified by two sister-project studies of small or medium scale, and the study of Lim (1994). Results regarding actual productivity are few and inconsistent. Significant decrease in development effort per module, asset or product is reported in one study, while four small-scale studies have evaluated it and reported positive results, although not significant. Reuse led to significantly lower problem density and less rework effort, verified in several studies of all scales and in both sister-project and component-comparison studies. There are other benefits that are verified in single studies and a few disadvantages as well (but not as primary results of studies).

Ranking of evidence

Kitchenham (2004) ranks evidence obtained from studies in five classes, where 1 is the highest and is assigned to the evidence obtained from at least one properly designed randomized controlled trial, and 5 is the lowest and is assigned to the evidence obtained from expert opinion based on theory or consensus. We did not find examples of experiments but when the degree of control is used to rank the evidence, three of the sister-project studies (Succi et al. 2001; Morisio et al. 2002; Baldassarre et al. 2005) would rank higher. Another step in appraising evidence is to evaluate how studies have handled validity threats. We found no analysis of metrics regarding their construct validity. Metric selection criteria was either not given or based on availability of data. Object selection was based on convenience or the criteria were not described, and in most of the studies, relation of the author to the case was not discussed. We do not criticise studies for selecting objects, subjects or metrics based on convenience, but for the absence of discussion regarding the selection process. Few papers discuss validity threats and confounding factors or seek alternative explanations. The quality of data is generally not discussed and we conclude that there is much room for improving design, analysis and reporting of studies.

Theory in software reuse

The need for useful and sound theories has never been emphasized more, but there are few examples of what constitutes theory in software reuse research. Gregor (2002) extends the definition of theory from explaining “why” to cover different stages in research. The goals of theory as defined by Gregor and the contributions of the review are summarized in Table 10.

Table 10 Theory in software reuse based on the review results

8 Answering RQ5—What are the Shortcomings in Reuse Research?

We discuss the question in three sections: evaluating ROI, measurement issues and other ideas for future research.

8.1 Evaluating ROI

In a recent paper by Frakes and Kang (2005) on the state of research on reuse and its future, the authors write that “much data on the effect of reuse on important variables such as cost of software production, time to market and project completion time have also been reported, though these studies tend to be quasi-experimental.” We have not found support for this claim in the reviewed literature. There may be several explanations:

  • Researchers are not interested in cost-benefit analysis. We think that the extensive amount of literature and models on reuse economics rejects this explanation.

  • Companies collect little data that may be used in credible cost-benefit analysis. It is possible to think that once the decision on reuse is taken based on some initial estimates of costs and benefits, companies do not collect data to evaluate such estimates. It may be difficult to account for investments in reuse such as infrastructure, training or making assets reusable. Another reason may be reliance on expert opinion.

  • Data are often analyzed by outsiders and not the company personnel. Outsiders have limited access to data on reuse investments, while industry either does not evaluate or does not report evaluations of economic success or failure of reuse programs.

Evaluating the above explanations, suggesting other ones or performing realistic ROI analysis on reuse are subjects for future research. Sustainability is related to making better links between reuse and corporate strategy (Frakes and Kang 2005).

8.2 Measurement Issues

Frakes and Terry (1996) have presented a survey of reuse metrics and models and classified these in six types: cost-benefit models, maturity assessment models, amount-of-reuse metrics, failure modes models to find reuse impediments, reusability assessment models, and reuse library metrics. This review only found examples of metrics related to cost-benefit models (Table 8), the amount-of-reuse metrics (reuse rate in Table 3), and reusability assessment models (module-level characteristics in Table 7). A comparison of metrics shows several challenges:

  • Measuring reuse of other assets than code and effort spent on reuse: Software architecture, design, test cases and templates are reused but their reuse rate, effort to make them reusable or adapt them to a context are not quantified. One obvious reason is that changes in other assets than code is not measurable by tools and involves human judgement.

  • Using comparable metrics: Few studies have used comparable metrics. Future studies should define their metrics precisely compared to the ones already used, and if possible use identical or comparable metrics.

  • Validating metrics: Metrics should be evaluated by assessing their relation to quality (quality is defined in so many ways, but everybody agrees that it is made up of a collection of attributes where being fault free and delivered on-time are a few of them (Glass 1997)), prediction value or their power of discrimination. Such analysis needs data from several projects or over time. Using expert opinion is also an alternative when such history does not exist; see for example Li and Smidts (2003).

  • Measuring actual productivity: Actual productivity is often calculated by dividing size of new code to total effort (see Table 5), but this does not show the productivity of a project. One solution to this problem is to define the size of developed software as the size of new code plus the equivalent size of reused code. Then actual productivity may then be measured by dividing the size of developed software by total effort. New code covers also glue-ware written to integrate components or add-ware to modify components or make them reusable. COCOMO II includes a model to estimate the equivalent size of reused software depending on factors such as design and code modification rate and the understandability of reused software (Boehm et al. 2004). Other models may be developed for the context.

8.3 Other Gaps for Future Research

Frakes and Kang (2005) propose future research to concentrate on techniques such as better presentation of reusable assets, education on reuse in universities and training in industry, sustainability of reuse programs, identifying and validating metrics of reusability, and relationship of reuse and domain engineering to newer software development processes such as agile methods. The results presented in this review highlight the following gaps for future research (other than ROI and measurement issues discussed before):

  • Longitudinal studies over releases to validate metrics and conclusions, identify costs or break-even points, and organizational impacts.

  • Reuse process with integrated metrics for reuse. Mohagheghi et al. (2004) discuss the lack of reuse metrics in the development process. Baldassarre et al. (2005) presented a reuse-oriented process as part of a full reuse maintenance model. Other studies do not give any description of the relation between reuse and software development processes.

  • Improving the state of data collection and analysis in industry. The fact that data collection in several studies needed development of additional tools and restoring of data shows the gap between academia and industry in collecting data. Even for problem reports that are collected by all companies, there are major concerns regarding the quality of data and its prediction value. In many cases data are given to researchers in formats that are not analyzable due to the limitations in commercial tools. This observation confirms that the collected industrial data are not analyzed by industry to a large extent (Mohagheghi et al. 2006).

  • Study reuse of COTS and OSS components. The review mainly found examples of internal development for and with reuse. A recent large survey in Norway, Italy and Germany showed that 1/3 of ICT companies practiced some OTS-based development (Conradi et al. 2005), and a survey of 61 ICT companies in Norway (although a non-representative sample) showed that 68% of them are using COTS or OSS components (Sommerseth 2006). Recently, a report on defect density of over 30 widely used OSS products were published (Coverty 2006), but we don’t have observations from industry.

  • Developing theories and models in addition to presenting results. We believe that this review has taken a first step by analyzing evidence and collecting explanations. Future studies should start with theory or models of inputs and outputs, be more explanatory regarding their results by combining quantitative evidence with qualitative observations, and discuss validity threats. They should seek for multiple explanations and insight.

9 Lessons for Future Studies

The review of literature from different views brings a range of issues to the forefront for improving the state of research.

9.1 Defining Context and Data to Report

One important question is how much to report on the context to allow comparison of studies. The guidelines for empirical research by Kitchenham et al. (2002) may be used in design and reporting of studies. Based on the review results, we have summarized the minimum for reporting in Table 11.

Table 11 Reporting context and data

9.2 Data Analysis

Industrial studies are mostly of the observational type. Researchers use data that are collected in the industrial settings and have usually little control over the environment or the data collection procedures. Researchers often apply statistical inferential techniques on this collection of non-random data with questionable quality. There is more control in sister-project case studies but again the settings are not artificial to control all the variables. We discussed in Section 6 that variations of the Fishers’s null hypothesis testing are the dominant method for inference. Some improvements to the analysis may be suggested.

Hypotheses statement

Alternative hypotheses are only stated in few papers, while others assume the alternative hypothesis to be a state of difference between means without mentioning it. Papers should be more explicit on that. There are examples in Morisio et al. (2002) and Mohagheghi et al. (2004) where the expected outcome is stated as the hypotheses of the studies.

Testing hypotheses

P-values are only reported in three studies which make taking independent conclusions difficult for readers or for combining the results in a meta-analysis. Gigerenzer (2004) and Wang (1993) recommend reporting p-values rather than the accept-reject method based on a fixed threshold level. We have not found a study where the experimental approach of defining sample size and type I and II errors beforehand was done, or the power was calculated. If applicable, see (Dybå et al. 2006). However, effect size may be discussed in some cases even without experimental design. Effect size is the difference of mean values divided by the pooled standard deviation and gives information on the actual observed difference.

Evaluating the results

Researchers often take a conclusion of rejecting or not rejecting a hypothesis which is expected of them to do, but this should not merely depend on p-values. Descriptive statistics and discussion of practical significance in the settings are fundamental. One reason is that selection of a 0.05 or any other level is often subjective. A second reason is that practical significance is a balance between effects and costs. We found earlier that there is no golden figure for the reuse rate and the same is true for improvements in productivity or problem density. While a 5% reduction in problem density may be considered as significant in one setting, it may be considered as too low in another setting relative to the investments on reuse.

9.3 Explaining the Results

We observed that few studies have discussed why a result is observed or try to establish causality by eliminating alternative explanations. As discussed in Shadish et al. (2001), it is especially important in non-experimental designs to assess alternative plausible explanations. The results should also be discussed relative to internal goals (Lim discusses that in one case, reuse rate exceeded the internal goal after a few years), previous data, feedback from industry, the authors’ experience, or impacts on a company’s reuse process and decision-making.

9.4 Evaluating Contribution for External Validity

Based on the goals of theory depicted in Table 10, reporting from single case studies can have one of the following goals:

  • Identifying commonalities and differences between settings for the purpose of analyzing and describing, and predicting the impacts for the outcome.

  • Providing new or interesting insights for the purpose of understanding or explaining.

  • Theory development or verification of theory either for explaining, predicting or action.

We may therefore ask the following questions to evaluate the contributions: Do the results strengthen or weaken our previous theories or expectations? Does a study have characteristics that make it unique for identifying new variables, gaining new insights or expecting atypical results? Does a study fill a gap or answer a question where we have little evidence, for example related to actual productivity or reuse of externally-developed components? Can we make conclusions about the theory or outcome and not about the population; for example, extensively modified components will probably be more defect-prone since the original design is significantly altered.

10 Conclusion

This review examined empirical studies in industry published between 1994 and 2005. The contributions of the review are identifying the extent and type of empirical research on reuse in industry, identifying context parameters, analyzing the metrics, combining the findings, seeking for explanations and theory building, identifying gaps for future research, and suggestions for improving empirical studies in this field.

General observations

Industrial studies may assure a high degree of relevance since the settings are not artificial and the developers are professional. On the other hand, several conditions are not controlled by the investigators. This fact does not explain the low quality of much research in the field. We found several papers that lacked information on their study design, research questions and hypotheses. Generally, metrics were poorly defined and there were few discussions on the quality of data. Also, the insufficient discussion of results in many studies and the lack of attention to establishing causality create serious doubts about the validity of conclusions. Little research has been done on important questions regarding sustainability of reuse programs and their impact on organizations and businesses. Since experimentation does not seem to be applicable in subjects such as software reuse that need context and observation over long time, we must strive to improve the quality of observational studies. Journals and conferences can play an important role here by requiring higher quality from the submitted papers.

Findings

We performed the most basic form of combining empirical evidence which is vote-counting, showing also when the results are non-significant or when significance is not discussed. The best would be to have adequate number of studies to divide them depending on the study types. In spite of the concerns discussed above, the review found evidence for significant positive effects of reuse on:

  • Software quality: There is positive and significant evidence on lower problem density (defect-, error- or fault density) and effort spent on corrections (rework effort) with introducing systematic reuse in industry. Problem density and rework effort may have been selected because industry collects problem reports, and relates product quality to the reported problems and effort spent on correcting them. A ranking of software engineering metrics by experts showed that problem density was among the top three measures in all phases of development (Li and Smidts 2003). The relation between the selected dependent metrics and quality needs better validation to improve construct validity.

  • Productivity: There is positive and significant evidence on apparent productivity gains in small and medium-scale studies. Increasing productivity has been one of the main motivations for reuse. The results for actual productivity are inconsistent and the definition of metric is also problematic. Further studies are therefore necessary to evaluate productivity gains.

The scale of the studies and other characteristics such as application domain and the approach to reuse varied. The variation shows that reuse works in various situations and is practiced in multiple ways. Software industry has few standardized metrics and comparing studies may lead to a progress in this area.

One important question is to identify contexts where reuse is beneficial and how reuse should be applied to observe the benefits. Evidence collected from the studies suggests that:

  • It is verbatim reuse and reuse with slight modification that results in significant lower problem density, development or correction effort.

  • When reusable assets are on the level of modules or functions, smaller and less complex ones may be reused more often. For large-scale reuse, the reusable assets may incorporate difficult design decisions.

  • Medium and large-scale reuse programs invest on developing the reusable assets internally.

Gaps

The intention of the review is to assist decision-makers about reuse investments and future research to focus on unsolved issues. It highlights gaps in empirical research to be:

  • For researchers, verifying economic returns of reuse, using comparable and consistent metrics for measuring reuse and its effects so that empirical evidence can be collected and appraised in a more effective way, improving analysis and statistical thinking, and improving the state of research design and reporting are major challenges.

  • For industry, improving tools and data collection routines, evaluating reuse of COTS and OSS components, integrating reuse in software development processes and analyzing own data are major challenges. We observed a great deal of variance among studies regarding the amount of reuse, problem density and productivity gains. It is therefore necessary to have explicit internal goals and baselines, and link benefits to strategic or economic values. It is interesting to notice that only in three studies, the authors were employees of the companies or have stated this relation clearly. The question is whether any feedback was given to industry in the other studies performed by outsides.

In addition to identifying gaps, we provided suggestions for improving reuse research based on the results of the review.

Final comments

The evidence is sparse and we may hope that the positive trend of year 2005 with four published papers continues. We found too few empirical studies to generalize findings in several aspects. The review results are presented in several tables that increased the length of the review, but we mean that the data are useful for preparing future research. Performing empirical studies in software engineering does not have a long history, with much to look for and learn about.