Keywords

1 Introduction

Code clones are code fragments similar to one another in syntax and semantics [1]. One location is copied and paste it to another location with or without modifications during software development. This kind of activity causes multiple copies of exact or closely similar code fragments to coexist in software systems. These code fragments are known as clones [2]. In most cases, cloned codes are harmful in software maintenance and evolution [38]. Although code cloning can help developers to quickly reuse existing design and implementation, it also incurs a significant increase in development and maintenance cost because programmers need to apply repetitive edits when the common logic among clones changes. Furthermore, failing to apply those changes can result in defects and field failures. On the other hand, there has been a good number of empirical evidence in favor of clones concluding that clones are not harmful [912].

Refactoring improves code structure without changing program behavior [13]. Fowler introduced many techniques for refactoring in his book, which is widely read by practitioners [14]. One of the most frequently performed refactoring techniques is “Extract Method,” which means extracting one part of an existing method as a new method and replacing the extracted part with a procedure call [15]. This technique, a common way of reducing repetitions in writing code, is also known as “extract function” or “extract procedure.” The commonly used refactoring tools on various IDEs, such as Eclipse, support procedure extraction to a certain degree in order to help programmers in dealing with this common and recurring situation.

Refactoring is widely used to delay the degradation effects of software aging and facilitate software maintenance [16]. However, there is a problem that causes the output results of the clone code detection tool which is not to be directly refactored. The problem is all code clones detected by a code clone detection tool are not appropriate for refactoring [17]. So far, no study has mentioned the method of eliminating false positives of cloned code-related bugs. The chief contribution to this paper is as follows: A metric method is developed to identify clone groups that are suitable for refactoring.

The rest of the paper is organized as follows: Sects. 2 and 3 provide the background and the clone analysis algorithm developed in our research. Section 4 outlines the directions for future work.

2 Relate Work

2.1 Cloned Code

Cloned code also known as duplicated code is similar code fragments to one another in syntax and semantic. Programmers’ copy–paste-modification practice is regarded as one of the main reasons for majority of clones. There are four types of cloned codes up to now:

  • Type-1 clones: Identical code fragments except for variations in white-space and comments.

  • Type-2 clones: Similar code snippets, where identifiers/variables can be renamed.

  • Type-3 clones: Code fragments may be one or more statements added/modified/deleted beyond the syntactic similarity.

  • Type-4 clones: Code fragments that perform the same calculation with different syntax.

Previous studies reported that software systems may have 5–15 % duplicated code [18], up to 50 % [19]. Based on the level of analysis applied to the source code, the techniques can roughly be classified into four main categories: textual, lexical, syntactic, and semantic [5].

2.2 The Difficulties of Identifying Refactoring Opportunities

Code clone detection can be perceived as the identification of code fragments to be refactored [3]. However, not all clone groups are suitable for refactoring. Usually, large-scale software systems have complicated intertwining logics, which makes it difficult to identify which code clones can be merged and how best to merge them [3].

3 An Approach to Identifying Refactoring Opportunities with Metrics

Other than computing resources, refactoring via function extraction incurs some software maintenance costs by resulting in dependencies. Each dependency means a contract that needs to be maintained by the development team. On the other hand, refactoring via procedure extraction also provides a benefit by resulting in a size reduction, i.e., a smaller number of code lines to maintain for the team. In this section, we derive a method to identify clone groups which are suitable for refactoring by analyzing costs and benefits of refavoring via procedure extraction. This cost–benefit analysis method makes an assumption by assigning the same weight to a dependency and a line of code. These weights can be adjusted by software developers or managers depending on their particular context and needs.

3.1 Benefits

The benefits of Extract Method refactoring are the reduction in the length of cloned code. Herein, we assume that clone group F includes code fragments f 1 , f 2 , …, f m . As a result, the benefit of extracting clone group F can be represented as

$$ {\text{Benefit}}(F) = m \times (|cf| - 1) $$
(1)

where |cf| is the number of statements which can be extracted in each fragment of group F. In some cases, there are some non-cloned code which cannot be moved outside the cloned code statements for the dependencies. Therefore, the statements which can be extracted may include both cloned code and non-cloned code. However, procedure extraction produces a procedure call in the original method. Therefore, actually, the length of reduction is equal to |cf| − 1.

3.2 Costs

Coupling is used to indicate the cost of procedure extraction. The principle of strategy for merging code clones is migration of duplicated code to another place. To migrate implemented code, it is desirable that the code has low coupling with its surrounding code [3]. In this paper, we mainly focus on data coupling. Consequently, we calculate the coupling between the original method and the new method (result of Extract Method refactoring) by counting how many parameters are needed by the new method. The detailed formula is shown as follows:

$$ {\text{Coupling}}(F) = \sum\limits_{i = 1}^{m} {\left( {|P(i)_{\text{in}} |} \right)} + \left( {|P(i)_{\text{out}} |} \right) $$
(2)

where \( |P(i)_{\text{in}} | \) and \( |P(i)_{\text{out}} | \) are the amounts of the input parameters and output parameters of the new method if clone fragments are extracted from their inclosing method.

For each fragment, we denote the externally defined variables and modified by it as V w , and externally defined variables accessed but not modified by it as V r . The variables that appear before the fragments are denoted as V b , and the variables that appear after the fragments are denoted as V a . If the fragment is extracted as a new method and called in the original place, variables which appear before the fragment and accessed by the fragment (no matter read or write) should be passed in as input parameters. Those modified by the fragment and accessed by following fragments should be returned as output parameters.

The formulas are shown as follows:

$$ P(i)_{\text{in}} = V_{b} \cap (V_{w} \cup V_{r} ) $$
(3)
$$ P(i)_{\text{out}} = V_{a} \cap V_{w} $$
(4)

In this paper, the \( |P(i)_{\text{out}} | \) is 1 or less for we acquire the return value of the new method is no more than 1 in C programming language. If the value is more than 1, then the fragment is not suitable for extracting.

3.3 Evaluation of the Benefit and Cost

The ratio of benefit/cost can be represented as

$$ R(F) = \left\{ {\begin{array}{*{20}l} {\frac{{{\text{Benefit}}(F)}}{{{\text{Coupling}}(F)}} = \frac{m \times (|cf| - 1)}{{\sum\nolimits_{i = 1}^{m} {(|P(i)_{\text{in}} | + |P(i)_{\text{out}} |)} }},\left( {{\text{Coupling}}(F) > 0} \right)} \\ {\begin{array}{*{20}l} {{\text{Benefit}}(F) = m \times (|cf| - 1)} & {\left( {{\text{Coupling}}(F) = 0} \right)} \\ \end{array} } \\ \end{array} } \right. $$
(5)

If R(F) > 1. then this clone group can be suitable for refactoring or it is not.

In addition, some cloned statements are only composed of declaration statements. These cloned codes are not feasible for refactoring because of the high coupling between the original method and the new method extracted from the original one. We have evaluated all cloned code in the selected open-source programs. The results are shown in Table 1.

Table 1 The results of identifying clone groups that are feasible for refactoring

4 Future Work

Our results indicate that our approach accurately identify clone groups that are feasible for refactoring. In future work, we hope our study motivates IDEs such as Eclipse CDT and Microsoft Visual Studio to provide functionality to automatically analyze cloned code. We will replicate this study using more systems. In particular, we will extend our study on cloned code analysis to prune more kinds of false positives.