Keywords

1 Introduction

TD is one of the most recent concepts that has been introduced in software engineering. It acknowledges the trade-off between code quality and the need to meet market expectations (e.g., low costs, short time-to-market, etc.). This is a typical situation for startup companies that have strict requirements to produce a Minimum Viable Product (MVP) to test the market and get funding to survive. In such contexts, the sub-optimal decisions that decrease the quality of the system leading to the creation of strategic TD [72] could be a key strategy to achieve success. In any case, companies should understand that such sub-optimal decisions require additional effort to fix the product in the long run [20, 21]. However, creating TD could be a valuable strategy to push products on the market knowing that the debt needs to be payed (with interests) in the future.

This phenomenon was originally described by Ward Cunningham in 1992 [22] introducing the concept of TD. There are many more sources of TD that have been investigated recently that involve communication, collaboration among team members, documentation, and individual attitudes [37, 72].

Since TD is a way of measuring the effort needed to achieve top quality in a software system compared to the current status, it is of paramount importance being able of measuring (or estimating) it. The importance of such an activity is proved by the simple fact that most of the software projects have some TD [25]. Being able to estimate TD allows development teams and managers to plan the work properly.

It may also happen that TD is too high to be payed [18], requiring different approaches to address it (e.g., rewriting the system). However, knowing that and how the system reached that condition could help in the identification of mistakes and improve the development process.

Frequent changes of software artifacts (mainly in the source code) without corresponding quality assurance measures quickly leads to a decrease in software quality, with an increase in the costs for further development and evolution due to the increase TD [19]. Moreover, the evaluation of TD should be performed automatically to avoid increasing the load of the developers and being able to monitor that continuously during any phase of the development. This is particularly useful in conjunction with the usage of Agile approaches since their delivery-oriented nature and continuous adaptation to the needs of the customer can be more prone to generate TD compared to traditional software development. However, they are also more prone to pay TD through the a proper implementation of refactoring.

For all these reasons, being able of measuring TD automatically is of paramount importance to support the daily work of developers. There are many different approaches to TD in literature and this paper provides an extensive analysis pointing out the current status of the research extending the work the same authors in [33]. In this paper, we have enhanced the analysis including a wider number of primary studies.

The paper is organized as follows: Sect. 2 describes the adopted methodology; Sect. 3 discusses the findings; Sect. 4 investigates the related work; Sect. 5 analyzes the threats to validity; finally, Sect. 6 draws the conclusions and introduces future work.

2 Methodology

The protocol adopted for this Systematic Literature Review (SLR) is the one introduced by Kitchenham and Charters [34] for performing such reviews in the software engineering area.

The main goal of this work is to review the existing studies and highlight the aspects related to TD measurement, therefore we have defined the following research questions:

  • RQ1: Which are the existing techniques for measuring TD?

  • RQ2: Which are the tools that support the automation of the measurement of TD?

  • RQ3: Are there any empirical studies able to demonstrate the usefulness of the identified techniques?

  • RQ4: Are there any empirical studies able to demonstrate the usefulness of the tools identified?

To answer the research questions, we have searched for papers using the three largest digital libraries: ACM Digital Library, IEEE Xplore, and Google Scholar.

Since only studies focusing on TD as main topic are interesting for our purpose, we suppose that their title or abstract include the key words technical debt measurement. Consequently, we used appropriate queries for each library:

  • ACM Digital Library:(+technical +debt +measurement) OR recordAbstract: (+technical +debt +measurement)

  • IEEE Xplore: ((“Document Title”:technical debt measurement) OR “Abstract”: technical debt measurement)

  • Google Scholar: “technical debt measurement”

The data have been extracted in two stages: in August 2018, when the initial version of the study started and in September 2019 to extend the study with the latest research available.

Only certain papers should be included to the final result: containing abstracts, considering TD as a main topic, written in English. No year constraint was specified, since we aimed at collecting all appropriate data despite of the date.

Many publications found in the digital libraries were not appropriate for our study since we were interested in primary studies published in referred workshops, conferences, and journals. Therefore, we excluded documents such as: summaries of workshops, tutorials, introductory descriptions of conferences, research plans, presentations, not primary studies, and technical reports. Therefore, we excluded all the documents that were not proper research papers.

Finally, we manually excluded all the papers not related to our research that passed the previous filters but still included in the list. The selection was performed after reading the entire content of the papers.

3 Results

We found 1,063 papers distributed as follows: ACM Digital Library (211), IEEE Xplore (317), and Google Scholar (535).

As expected, there was a significant overlap in the papers found in the different libraries. Therefore, the first step was merging the results and removing duplicates. Finally, at the end of the process, we selected 46 papers. The overall selection process is summarized in Fig. 1 (the numbers on the arrows show the amount of papers that passed each phase):

Fig. 1.
figure 1

Steps of the selection process.

  • Step 1: Merging All Papers from Data Sources. The initial list included 1,063 papers but many duplicates were present. The identification of the duplicates was performed manually to avoid problems with minor character differences in the titles and in the author names. At the end, we had a list of 835 unique papers.

  • Step 2: Applying Exclusion Criteria. At this stage, we applied the exclusion criteria resulting in a selection of 524 papers. At this stage we still kept in the list the secondary studies.

  • Step 3: Excluding not Primary Studies. At this stage, we identified the secondary studies (e.g., systematic reviews, systematic mappings, etc.) that were removed from the list and analyzed in Sect. 4. The secondary studies identified are 10 and the list is reduced to 452 papers.

  • Step 4: Considering Studies Related to Measurement of TD. Reading the title and the abstract of the 452 papers, we identified the studies related to the measurement of TD. We identified 38 papers distributed between 2011 and 2019 as described in Fig. 2.

  • Step 5: Quality Assessment. We read the 77 papers identified and we excluded 39 of them since they were not dealing with the measurement of the technical debt even if from the title or the abstract they appeared appropriate for our investigation.

Fig. 2.
figure 2

Distribution of papers related to TD measurement over the years.

3.1 RQ1: Which Are the Existing Techniques for Measuring TD

The identified studies have been analyzed in terms of proposed techniques, their requirements about input data needed for the calculation of TD, the resulting information, advantages and disadvantages of the approach. Table 5 summarises the techniques identified while Table 1 compares the input required by the different techniques and Table 2 the output generated.

Table 1. Input of TD measurement techniques.

Letouzey [40] proposed a method for TD evaluation named Software Quality Assessment Based on Lifecycle Expectations (SQALE), which is described as an answer to the need for an objective and standardized open-source method with low false positives. At the official website of the methodFootnote 1, there is a list of several tools able to analyze the code written in different languages.

The method defines how to formulate and organize non-functional requirements that can affect code quality defining a herarchical structure of characteristics and sub-characteristics similar to the ISO quality model. SQALE has been developed to be automated and considers several properties of the code but two main aspects are not taken into account. The first one is that non-conformities for business or operations are not considered important by any index of SQALE (considering version 1.0 [41]). The second one is that there is no definition of the level of implementation of the requirements.

CAST [23] presents a formula with flexible parameters to measure TD. That flexibility implies the possibility of adjusting the parameters to the specificity of a particular organization. The approach defines five Health Factors that have a different impact on the overall TD: Changeability (30%), Transferability (40%), Robustness (18%), Security (7%), Performance Efficiency (5%).

Violations in each area are rated according to their severity and a formula is applied for calculating the final value of the debt. The approach has been evaluated on 745 business applications containing more than 10 KLOC using the CAST proprietary Application Intelligence Platform.

The SIG/TUViT approach [52] is based on a sound and quantitative approach for measuring software quality from source code. Moreover, the estimation of TD is based on empirical data using a model that is quite simple.

Mayr et al. [49] define a model that provides a combination of the benefits of the flexible approaches to quality changes and the simplicity of the SIG model. The approach requires only information from static code analysis. The output is simple as well, being the hours of work required to pay the debt.

Skourletopoulos et al. [68] developed a fluctuation-based modelling approach to TD. It measures the amount of profit not earned due to the under-usage of a given service and considering the probability of over-usage of the selected service that would lead to accumulated TD. The hypothesis is that service capacity affects to service choice, which is made with respect to the predicted fluctuations in the number of users over some time and the way TD is gradually paid off. Consequently, formulas for predicting appearance of TD were developed, as well as tools for validating them.

Chatzigeorgiou et al. [18] provide an estimation of a breaking point, that is when debt becomes too large to be paid off. The source code is initially assessed by fitness function based on the Entity Placement metric quantifying coupling and cohesion. The approach is based on the identification of the best design for a system. The cost of reaching that best system with necessary refactorings is calculated as well as number of versions leading to the breaking point. However, the authors point out some issues to be considered:

  • only coupling and cohesion dimensions exist for the method, but TD has many other aspects

  • maintenance effort means not just adding lines of code, but deleting and modifying them

  • future maintenance effort cannot be predicted solely on the basis of past maintenance tasks

Kamei et al. [32] propose measuring the self-admitted TD interest with code metrics like LOC (because it well correlates with code complexity metrics) and Fan-In (showing how much one piece of code affects another one). They have validated the approach on the Apache JMeter project.

Marinescu [46] proposes a framework exploring TD symptoms at design level. The construction of such framework includes four steps:

  1. 1.

    definition of the principles for finding design defects

  2. 2.

    identification of a set of relevant design defects

  3. 3.

    estimation of the impact of each defect

  4. 4.

    the overall design quality is calculated

The framework also includes:

  • a coarse-grain approach to monitor the evolution of TD over time

  • a more detailed approach that enables locating and understanding individual flaws, which can lead to a systematic refactoring

The approach has been applied in a case study including 63 releases of two well known Eclipse projects (JDT and EMF). However, the conclusions of the case study cannot be generalized, considering the restricted number of systems analyzed and the limited number of design flaws that were included in the actual instantiation of the framework.

In the framework proposed by Singh et al. [66], TD estimation is based on measures of code maintainability obtained via static analysis and interest estimation based on activity data obtained by monitoring developer actions in the IDE. Main contribution of the framework is the integration of a developer activity data with code metrics and to improve the understanding of developer comprehension effort resulting in an improved accuracy of the estimation.

Table 2. Output of TD measurement techniques.

Although the Architectural Technical Debt (ATD) is difficult to measure, the Average Number of Modified Components per Commit (ANMCC) is a metric proposed in [43]. However, commit records may not exist anymore, therefore the authors suggest to use Index of Package Changing Impact (IPCI) and Index of Package Goal Focus (IPGF) instead of ANMCC. The advantage of using such two new metrics is the possibility of obtaining them directly from the source code. Then validation of correlation of that metrics with ANMCC is performed. However, the weakness of whole study is relying only on results of projects developed in C#.

Martini et al. [47] conducted a multiple embedded case study in seven sites at five large companies to investigate the current causes for the accumulation of Architectural TD (ATD). The authors investigated two research questions: (i) factors cause the accumulation of ATD, and (ii) current trends in practice in the accumulation and recovery of ATD over time. The authors provided a taxonomy of causes and their influence in the accumulation of ATD.

Maldonado et al. [45] examined code comments to identify and evaluate Self-admitted Architectural Debt (SATD). The strength of the approach is the usage of heuristics to eliminate comments which are not likely to affect TD. In addition, the method classify comments to different types of SATD.

Besker et al. [12] show critical results for SATD management, such as the fact that monitoring and evaluating ATD using accurate metrics is a key issue and it is not fully supported by any currently available tool.

Flisar and Podgorelec [26] developed a new SATD identification method which takes advantage of a large corpus of unlabeled code comments. The proposed feature enhancement method was used with the three most common feature selection methods (CHI, IG, and MI) and three well-known text classification algorithms (NB, SVM, and ME). It was tested on ten open source projects achieving 82% of correct predictions of SATD. The proposed method seems to be a good candidate to be adopted in practice.

Lenarduzzi et al. [38] applied the SZZ algorithm to label the fault-inducing commits and used 8 machine Learning techniques: Linear Regression, Random Forest, Gradient Boost, Extra Trees, Decision Trees, Bagging, AdaBoost, and SVM to show that the accuracy of TD can be improved. Authors found that among the 202 violations defined for Java by SonarQube, only 26 have a relatively low fault-proneness.

Pecorelli et al. [56] have reported on a large-scale empirical comparison between five different balancing techniques for ML-based code smell detection. The results suggest that ML models relying on SMOTE (Synthetic Minority Over-sampling Technique) realize the best performance. However, its training phase is not always feasible in practice. Furthermore, avoiding balancing does not dramatically impact the performance. Existing data balancing techniques are therefore inadequate for code smell detection. This hinders the feasibility of the current ML-based approaches.

Capitan and Vogel-Heuser [16] proposed metrics for identifying TD which based on IEC 61311-3 programming languages and adapted for the languages and cycle processing (Halstead’s and McCabe’s as examples).

Kumar et al. [35] proposed a novel approach for identifying TD in service composition in SaaS cloud. The approach combines time series forecasting and a newly proposed TD model to estimate the future debt and utility in the service composition. Through a real world case study, they demonstrate that the approach can successfully identify both the good and bad debts, while producing satisfactory accuracy on estimating the TD in the service composition in SaaS cloud.

Ciolkowski et al. [19] developed a prototype and a prediction model for forecasting potential savings based on proposed refactoring of key drivers of TD identified by the machine-learning model.

Verdecchia et al. [76] presented a novel approach to identify ATD of Android apps based on architectural guidelines extraction and modeling, architecture reverse engineering, and compliance checking.

Lavazza et al. [36] proposed a formal and executable model that supports the simulation of various scenarios in time-boxed software development and maintenance processes. The model is usable to show the effects that TD have on relevant issues such as productivity and quality, depending on how TD is managed, with special reference on how much effort is dedicated to TD repayment and when such effort is allocated.

TD visualizations were designed to improve stakeholder communication to support the business decision-making process at different levels of the organization. [53] concluded that the TD visualization contributes to improve the communication in the decision-making processes associated with the software lifecycle.

[61] addresses the problem of SATD (Self-Admitted Technical Debt) classification using a Convolutional Neural Network, which takes as input the source code comments and predicts as output whether the comment is a SATD comment or not.

Tsintzira et al. [74] used an established method for quantifying TD, namely FITTED, to measure the TD of an industrial software product and compare it to the perception of the software engineers.

3.2 RQ2: Which Are the Tools that Support the Automation of the Measurement of TD?

TD measurement techniques often require a large number of input data that require a large amount of effort to be extracted. Therefore, tools are of paramount importance to support development teams in the integration of TD measurement in their daily work. Table 3 provides a summary of the available tools and the methodology they implement.

Table 3. Tools able to support the automation of the measurement of TD.

SonarQube [27] implements the SQALE method of TD evaluation. It is used for continuous inspection of code quality to perform automatic reviews with static analysis of code to detect bugs, code smells and security vulnerabilities in several programming languages.

MIND (ManagIng techNical Debt) is an open source tool which is, to the best of our knowledge, the first tool supporting the quantification and visualization of the interest [24]. Basically, it is a plug-in for SonarQube. MIND uses a few metrics to count the interest:

  • Defect Proneness

  • Maximum Defects per 100 LOC Touched

  • Extra Defect Proneness

  • Maximum Extra Defects per 100 LOC Touched

  • Relative Extra Defect Proneness

  • Average Relative Extra Defect Proneness

  • Violation Density

  • Linkage

  • Estimation Error

JCaliper [18] was designed to find the placement of entities that minimizes the Entity Placement metric as a search-space exploration problem. It automatically extracts the number, type and sequence of refactoring activities required to obtain the design without TD.

Blaze is a monitoring tool [69] recording temporal sequence of developer actions, including code navigation actions and edit actions. The log produced is subsequently analysed to figure out class relationships and effort spent by a developer to understand program elements.

TortoiseSVN allows extracting commit records from standard SVN servers and any code repositories supporting Subversion, such as GitHub. That records are used by Li et al. [43] to perform ANMCC metric checking.

JDeodorant [73] is used in [32] for performing source code parsing. In particular, the ability to extract a comment and map it to its corresponding method is interesting. Later in the paper, to calculate the interest that is incurred over time, 16 code metrics were extracted using the Understand tool [1]. JDeodorant [73] is also used in [45] to parse the source code and extract the code comments. However, before that, the SLOCCount tool [77] is applied to calculate SLOC in Java files.

EXA2PRO [70] is a programming environment which integrates a set of tools and methodologies that allow to systematically address many exascale computing challenges, including performance, portability, programmability, abstraction and reusability, fault tolerance, and TD.

[60] presented a process framework for managing TD in commercial software product development. The framework integrates processes required for TD management with existing software quality management processes prescribed by the project management body of knowledge (PMBOK) (https://www.pmi.org/pmbok-guide-standards), and organizes the different processes for TD management in three steps: (1) make TD visible, (2) perform cost-benefit analysis, and (3) control TD. To implement the processes, they introduced a new artifact, the TD register, which stores the principal and the associated interest estimated for the TD related to an asset.

[7] introduced debtgrep, a tool to prevent from growing dependency violations, violation of naming conventions, usage of deprecated API’s, and other kinds of mostly invisible TD. They provide some specific examples of use cases for debtgrep.

[67] introduced a tool used for extracting coupling and cohesion metrics at package level to study their impact on TD. The dataset of their study consisted of approximately 1,200 software packages.

[19] introduced the ProDebt tool, a methodology and a software tool to support the strategic planning of TD in the context of agile software development.

[6] proposed a new index for the evaluation of architectural issues as Architectural Smells (AS) and developed a tool to detect AS in Java projects. They focused on AS based on dependency issues, since components that are highly coupled and with a high number of dependencies cost more to maintain and can be considered more critical.

[13] introduces an open source tool for automatic architectural smells detection for C/C++ projects, by creating an abstraction of the project and defining the concept of dependency between elements belonging to the project in order to identify architectural smells.

[48] developed a holistic framework for the semi-automated identification and estimation of ATD in the form of non-modularized components.

[14] presented a static code analysis tool and its usage for identification of TD in IEC 61131-3. The tool supports both bottom-up (study the metric values at individual module and convention violations) as well as top-down analysis (study the call graphs). In addition, the authors provide an extra analysis (horizontally) by making a comparison on the metric results between different demonstrators.

[2] presented Tracy, a decision-making framework that prioritizes TD considering how IT assets support a company’s business processes, thus providing a new perspective on TD management.

[50] presented VisminerTD, a tool that allows the automatic identification and interactive monitoring of the evolution of TD items by combining software metrics, code comment analysis, and information visualization. The results provided evidence on the use of the proposed tool, indicating (i) that it can be useful in supporting TD identification and TD monitoring activities and (ii) that it can bring gains in terms of comprehensiveness and efficacy when evaluating the desirable time to identify and monitor different types of debt.

[71] found that the tools used cannot help in identifying many important TD types, involving humans is necessary. Tools could help to identify TD faster or more accurately however, project priorities and current development activities are important to be considered together, along with the values of principal and interest, when deciding to provide a comprehensive evaluation of TD and pay it off.

3.3 RQ3: Are There Any Empirical Studies Able to Demonstrate the Usefulness of the Identified Techniques?

The empirical studies performed to validate the identified techniques are summarized in Table 4.

Table 4. Identified techniques and the related empirical studies.

[28] assessed three methods [23, 40, 46] to find out if they effectively describe the relationship between the quality of the system and the level of TD.

Izurieta et al. [31] uses Nugroho et al. [52] to exemplify the methodology.

A Benchmarking-based Model of Mayr et al. [49] is closely related to their earlier work on benchmarking-oriented quality assessments. Also it calculates the remediation cost in a way similar to the approach of CAST [23].

Relevant code structure metrics in the framework for estimating interest on TD [66] were selected in such a way that related to maintainability and TD in [52]. Similar to the prior work, static code metrics are used.

[39] conducted an empirical study on 21 well-known mature open-source projects to confirm the hypothesis about the fault-proneness of the SonarQube violations.

[78] selected four different TD identification techniques (code smells, automatic static analysis (ASA) issues, grime buildup, and modularity violations) and applied them to 13 versions of the Apache Hadoop open source software project. The authros showed that different TD techniques are loosely coupled and therefore indicate problems in different locations of the source code. Moreover, their proxy interest indicators (change and defect-proneness) correlate with only a small subset of TD indicators.

[64] surveyed empirical research work in the arising topic of SATD after 2014 and until the compilation of this survey in July of 2018. They compiled the tools and datasets that can be used as a foundation to motivate and facilitate the submission of novel and improved approaches for managing and ultimately, repaying SATD. Simultaneously, authors observed a lack of studies focusing on the repayment and management of SATD, which is of critical importance.

3.4 RQ4: Are There Any Empirical Studies Able to Demonstrate the Usefulness of the Tools Identified?

In [54], TD was measured using two static code analysis tools (Findbugs [8] and SonarQube [27]). The goal was evaluating if the code produced with the Test Driven Development approach has a lower TD than code produced using other techniques. This two tools are widely used in the community for measuring TD.

Other studies tested SonarQube: [44] use it for measuring TD in a particle tracker system; [51] use it for several calculation of TD in the software supply chain; [15] describes a case study in Ericsson, where they had to observe TD measurement tools to use them for evaluation system creation based on ISO standard 15939:2007.

4 Related Work

Investigating the different approaches for measuring TD could be valuable to practitioners and researchers to provide a better understanding of the field and identify research gaps. However, we were not able to identify any secondary study related to the research questions we listed in Sect. 2. Instead, several others deal with TD in general.

The systematic mapping study of Li et al. [42] was initiated to find and analyze publications between 1992 and 2013 of TD and its management. After the selection of 92 studies authors classified 10 TD definition, identified 8 TD management activities, and collected 29 tools for the latter.

Another systematic mapping study of TD definitions, Poliakov [58] has performed full review of 159 papers. 107 definitions were separated into keywords. Consequently, the main achievement of the research is built keyword map, supplemented by synonyms and types of TD.

Another literature review has been done by Alves et al. [3] based on three research questions. They evaluated 100 studies of 2010–2014 and proposed initial taxonomy of TD types, list of indicators for identifying TD, and existing management strategies.

There is a study considering another aspect of the phenomenon. Ribeiro et al. [62] state that the evaluation of appropriate time to pay TD and applying an effective decision-making criteria are an important management goals. Consequently, authors identified 14 such criteria for development teams. Also the results showed gaps where further research can be performed.

Recently, Behutiye et al. [9] considered a narrow field of study related to TD, which means that they synthesized the state of the art of TD and its causes, consequences, and management strategies only in the context of agile software development (ASD). In particular, after processing systematic literature review 38 primary studies, out of 346 studies, were identified and analyzed. Then five research areas of interest related to the literature of TD in ASD, as well as 12 strategies for managing it have been found. Authors identified eight categories regarding the causes and five categories regarding the consequences of incurring TD in ASD.

In the case of work performed by Besker et al. [11] ATD is considered as affecting to system success and able to cause expensive repercussions, so the goal is to create new knowledge with interest in ATD. Research efforts should be synthesized and compiled for that. The main contributing outcome of the paper is a presentation of a novel descriptive model, providing comprehensive interpretation of ATD phenomenon.

Finally, the last related work focuses on a specific view of TD. Employing a method for syntactic literature review and applying it to seven digital library studies sources Ampatzoglou et al. (2016) [5] analyzed financial aspect of TD. Authors conclude that the communication between technical managers and project managers is beneficial, because a vocabulary will be provided, and high-quality goals will be set up. In order to achieve this, they introduced a glossary of terms and a classification scheme for financial approaches.

[63] investigates current state of TD based on 13 secondary studies, dated from 2012 to March 2018, the work shows several interesting conclusions such as coverage of areas (code, test, process, etc.).

[75] investigated the state-of-the-art and examined the major contributions that have been made in the field of TD estimation and forecasting. The authors stated that already existing methods and tools for TD estimation have not reached a satisfactory level of maturity yet, while there is still a large volume of potential metrics and techniques that have not been used and that could potentially increase the completeness of the TD estimation concept. In addition, although there has been extensive research with respect to predicting the evolution of individual software features, quality attributes, and quality properties that are directly or indirectly related to the TD of a software project, no concrete contributions exist in the related literature regarding TD forecasting.

[48] proposed a Strategic Adoption Model for Tracking Technical Debt (SAMTTD) aimed at helping companies to assess their TD management process and make decisions on its improvement.

[10] performed a systematic mapping study to identify and analyzed the empirical studies about TD between 2014 and 2017. The authors presented the most common indicators to identify and evaluate TD and identified thirteen types of TD. They identified forty-eight tools from the selected empirical studies and found that in some empirical studies, there are more than three tools used to investigate TD. Others develop new tools and compared their results to open tools; Also they paid special attention to SATD throughout the code comments and smells as the most applied as indicators of TD.

5 Threats to Validity

The main threats to validity identified are the following:

  • Although the applied guideline [34] recommends to consider about seven digital libraries for performing an exhaustive search, in our case only three have been chosen. The reason of it is that other sources contain very few unique papers compared to the ACM and IEEE digital libraries. Moreover, to avoid missing important papers we used Google Scholar that index almost everything.

  • Constructing appropriate search string is a tricky task, since the title of some studies we are interested in does not include our key words, we decided to extend the search to the abstracts. Since we are interested in studies focusing on TD, we suppose that the key word is mentioned in the abstract.

  • A way of automatically merging the outcome lists from that libraries is risky, since even a single different symbol in title might affect the result. For that reason, duplicates were identified and eliminated manually during the creation of a merged list.

  • It may happen that some information has not been considered in our study since some papers could have been accidentally skipped or not present at the time of the query (September 2019).

6 Conclusions and Future Work

TD is a widely used buzzword but having a clear understanding of the available approaches and tools is quite difficult due to the large amount of material spread across a number of sources. This paper aimed at providing to researchers and practitioners an overview of the state of the art about TD focusing on the automated approaches.

According to the review, the research area is new and very active but still not mature. There is a constant presence of new approaches and tools that are not based on the outcomes of previous studies and researchers focus on validating their own approaches without any independent assessments. Moreover, such validations are frequently not replicable due to the usage of proprietary datasets. Therefore, additional effort is needed to identify cross-validated approaches with clear indications about their applicability. This is very important especially for practitioners since it is difficult for them to identify the models to apply in their specific contexts.

The study has also pointed out that where tools are available to support some specific approaches, they are often difficult to use requiring a complex setup and providing a limited support for the wide range of programming languages used in real projects. Moreover, most of the available tools are not able to measure or estimate the overall TD. They usually focus on the remediation costs and do not take into consideration the related interests (often named non-remediation costs) that are often very important for planning the development process and keep the debt under control over the entire lifecycle of a product.