1 Introduction

Refactoring is a key activity to preserve and improve the internal design of software systems. Due to the importance of the practice in modern software development, there is a large body of studies about refactoring, shedding light on aspects such as usage of refactoring engines (Murphy-Hill et al. 2009; Negara et al. 2013), documentation of refactorings using commit messages (Murphy-Hill et al. 2009), motivations for performing refactorings (Silva et al. 2016; Mazinanian et al. 2017; Tsantalis et al. 2013), benefits and challenges of refactoring (Kim et al. 2012; 2014), among others.

However, time seems to be an underinvestigated dimension in refactoring studies. The notable exception are studies on refactoring tactics, particularly on repeated refactoring operations, often called batch refactorings. For example, Murphy-Hill et al. (2009) define batch refactorings as operations that execute within 60 seconds of each another. They report that 40% of refactorings performed using a refactoring tool occur in batches, i.e., programmers repeat refactorings. But the authors also mention that “the main limitation of [our] analysis is that, while we wished to measure how often several related refactorings are performed in sequence, we instead used a 60-second heuristic”. Bibiano et al. (2019) investigate the characteristics and impact of batch refactorings on code elements affected by smells. The authors rely on a heuristic to retrieve batches (Cedrim 2018), which are groups of refactorings performed by the same author in a single code element. Thus, their heuristic focus on single methods or classes, most of the cases resulting in batches with a single commit (93%). However, to our knowledge, refactorings performed over long time windows were not deeply studied by the literature.

Therefore, in a previous conference paper (Brito et al. 2020), we propose and evaluate a novel concept, called refactoring graphs, to study and reason about refactoring activities over time. In such graphs, the nodes are methods and the edges represent refactoring operations. For example, suppose that a method f oo() is renamed to bar(). This operation is represented by two nodes, f oo() and bar(), and one edge connecting them. After this first refactoring, suppose that a method qux() is extracted from bar(). As a result, an edge connecting bar() to a new node, representing qux(), is also added to the graph. Furthermore, refactoring graphs do not impose time constraints between the represented refactoring operations. In our example, the extract operation, for instance, can be performed months after the rename. Finally, refactoring graphs may also express refactorings performed by different developers. In our example, the rename can be performed by d1 and the extract operation by another developer d2.

Quantitative study

We formalize an algorithm to build refactoring graphs and use it to extract graphs for 20 well-known and popular open-source Java and JavaScript projects. In this first study, our goal is to characterize refactoring subgraphs. Thus, we answer seven research questions about the following properties:

  1. 1.

    Refactorings over time: In both languages, approximately 30% of refactoring operations are part of a refactoring subgraph over time.

  2. 2.

    Size: Most refactoring subgraphs are small. In Java, most cases refer to subgraphs with up to four vertices (85%) and three edges (83%). Similarly, in JavaScript, most refactoring subgraphs have up to four nodes (86%) and three edges (85%). However, we also found subgraphs due to large refactoring operations (e.g., subgraphs with more than 30 vertices).

  3. 3.

    Commits: Most refactoring subgraphs are generated from two or three commits, e.g., 95% of Java subgraphs and 93% of JavaScript subgraphs.

  4. 4.

    Age: The age of the refactoring subgraphs ranges from a few days to weeks or even months. For instance, in both languages, approximately 60% of the subgraphs have more than one month.

  5. 5.

    Homogeneity: 71% of Java subgraphs and 64% of JavaScript subgraphs include more than one refactoring type.

  6. 6.

    Ownership: In both languages, about 60% of the refactoring subgraphs are created by a single developer.

  7. 7.

    Patterns: The most recurring over time patterns of refactoring graphs have two edges. For example, in Java, the most recurrent case refer to successive rename operations, i.e., renamerename (153 occurrences). In JavaScript, this is the second most recurrent pattern, appearing in 37 subgraphs in our dataset.

Qualitative study

In our quantitative study, we observed that most refactoring subgraphs are small. However, we also notice subgraphs describing large refactoring operations. Thus, in this second study, we selected and manually inspected 50 large refactoring subgraphs in terms of vertices. Then, we contacted the authors of these refactoring instances, asking for the motivations behind their operations. Based on these developers’ feedback, our results suggest that large subgraphs relate to two major reasons: improving code design and fixing bugs or improving existing features.

Paper extension

This paper is an extension of a previous study (Brito et al. 2020). We expand this former work in the following major points:

  1. 1.

    We perform a novel analysis on JavaScript systems (the former paper only included Java) and extend all RQs with refactoring graphs mined for this language.

  2. 2.

    We propose a new research question (RQ6) where we assess refactoring graph patterns, and an introductory research question (RQ0) about refactorings that are spread over multiple commits.

  3. 3.

    We perform a novel qualitative analysis by applying a survey with developers, aiming to understand the motivations behind large refactoring subgraphs.

  4. 4.

    We designed and implemented a web application to easily visualize refactoring graphs.Footnote 1 Also, we provide scripts to automatically visualize refactoring subgraphs for Java and JavaScript projects hosted on GitHub.Footnote 2

  5. 5.

    We provide a new evaluation of the precision of RefDiff (Silva et al. 2021; Silva and Valente 2017), which is the tool we used to detect refactoring operations. The evaluation relies on real-world Java and JavaScript open-source projects, increasing the existing datasets with new refactoring instances.

Structure

Section 2 introduces refdiff, which is the tool used to detect refactoring operations. Section 3 defines our concept of refactoring graphs. Section 4 describes the design of our quantitative study and results. Section 5 presents the second study, based on a survey with authors of large refactoring subgraphs. We discuss the key applications and implications in Section 6. Section 7 states threats to validity and Section 8 presents related work. Finally, we conclude the paper in Section 9.

2 RefDiff tool

refdiff (Silva and Valente 2017; Silva et al. 2021) is a tool to detect refactoring operations. The current version is based on the Code Structure Tree (CST), which provides a language-agnostic representation of the source code. As a consequence, it is possible to detect refactorings in multiple languages. In our study, we concentrate on two programming languages supported by the tool: Java and JavaScript.Footnote 3 We selected these languages due to their popularity. For example, they were pointed to amongst the most adopted and loved programming languages by developers worldwide.Footnote 4 Besides that, most refactoring research in the literature discuss refactoring practices only in Java (Bibiano et al. 2019; Sousa et al. 2020; Silva et al. 2016; Pantiuchina et al. 2020). Therefore, by studying JavaScript, we attempted to contribute with a refactoring study that also considered interpreted, dynamic, and very popular programming languages. Also, we focus on method or function level operations since refactoring operations frequently affect these elements (Silva et al. 2021; Tsantalis et al. 2018; Hora et al. 2018). Table 1 lists the refactorings detected by refdiff at these elements. As we can notice, both languages have well-know refactorings, comprising extract and inline operations, as well as changes in method’s signature (i.e., rename and move).

Table 1 Function and method level refactorings detected by refdiff

Since JavaScript is a dynamic language, inheritance-based refactorings are not detected in this language (i.e., pull up and push down). In additon, JavaScript-based systems usually contain large files that are composed of several nested elements. For this reason, many refactoring occur on a single file. refdiff reports these cases as internal operations. Listing 2 shows an example of an internal move operation. In this case, the developer moved function f1 from fa to fb. However, both functions are located in a single file.

figure a

3 Refactoring graphs

A refactoring graph G is a set of disconnected subgraphs \(G' = (V^{\prime }, E^{\prime })\). Each \(G^{\prime }\) is called a refactoring subgraph, with a set of vertices \(V^{\prime }\) and a set of directed edges \(E^{\prime }\). In this way, the history of a software system includes a set of refactoring subgraphs. In refactoring (sub)-graphs, the vertices are the full signature of methods or functions. For instance, in Java projects, we labeled a method m() in class F oo and package util as util.F oo#m(). Since Java is a strongly typed programming language, the signature also includes the type of the parameters. For example, we label the same method m as util.F oo#m(String) wherever it requires a string type parameter. In JavaScript graphs, this procedure is not practicable since it is an untyped language. Thus, we labeled the vertices utilizing the file name. For example, util.Bar.js.C#f1 represents a function f1 in class C, file Bar.js, and directory util. Finally, the edges indicate the refactoring type (e.g., move method) and they also include meta-data about the operation (e.g., author name and date).

Figure 1 shows an example of a refactoring graph. A developer extracted three methods from m1(), which are named x(), y(), and z(). The edges refer to the refactoring operation. It is worth noting that a refactoring graph can include refactorings performed by multiple developers. For instance, Fig. 2 illustrates a second example, where a developer D1 extracted two methods from m2(), which are named a() and b(). Then, a second developer D2 renamed b() to c(). After that, a reviewer might have suggested to keep the original name. Thus, the developer undid the latest refactoring, renaming c() to b() again. In this case, the graph contains refactorings performed by two authors. Besides, is created a cycle when the developer reverts the method to the original name.

Fig. 1
figure 1

Refactoring subgraph produced by only one developer

Fig. 2
figure 2

Refactoring subgraph over time

As presented in Fig. 3, in the case of Java, we center our study on eight distinct refactorings at the method level. Rename and move are the most trivial operations since they involve just changing the method’s signature. Extract operations generate new methods in the same class (i.e., they create a new node in our subgraphs). It is also possible to extract a method m() or multiple methods mi from a single method m1(). Furthermore, as illustrated in Fig. 3, it is possible to extract m() from multiple methods mi. In this case, the extracted code is duplicated in each method mi. Inline method is a dual operation, involving the removal of trivial elements and replacement of the respective calls by their content. As in the case of extract, we can inline a method m() in multiple methods mi. We also studied a refactoring called extract and move that extracts a method to another class. Finally, inheritance-based refactorings comprise the movement of one or more methods to supertypes or subtypes (i.e., pull up and push down). For example, a pull up moves methods from subclasses to a superclass.

Fig. 3
figure 3

Example of refactoring subgraphs (Java)

Similar refactorings apply to functions in JavaScript. As shown in Fig. 4, in JavaScript, there are also internal operations, i.e., refactorings performed in a single file.

Fig. 4
figure 4

Example of refactoring subgraphs (JavaScript)

4 Quantitative study: Characterizing refactoring graphs

4.1 Study design

In this study, our goal is to quantitatively analyze refactoring in multiple programming languages with the purpose of understanding and characterizing refactoring activities performed over time. The context of the study consists of approximately 1.5K refactoring subgraphs from 20 Java and JavaScript open-source projects. Since refactoring graphs are a novel abstraction, we see value in starting by shedding light on several of their properties. In other words, before performing a qualitative study with developers (Section 5), we found it important to mine the maximum amount of data and information about such graphs. Specifically, we address the following research questions, aiming to investigate seven properties: refactorings over time, size, number of commits, age, homogeneity, ownership, and patterns.

  • (RQ0) How many refactoring operations generate subgraphs over time? Most studies concentrate on refactorings performed in a single commit (Silva et al. 2016; Jiang et al. 2021; Di Penta et al. 2020; AlOmar et al. 2021). For this reason, the rationale of this preliminary research question is to assess the prevalence of the key practice we investigate in our study, i.e., refactorings that are spread over multiple commits.

  • (RQ1) What is the size of refactoring subgraphs? We are interested in investigating the size of refactoring subgraphs, in terms of number of vertices and edges. This investigation may provide insights about the impact of refactorings in the design/architecture of the studied systems.

  • (RQ2) How many commits are represented in refactoring subgraphs? Each commit can contribute to one or more refactoring in a refactoring subgraph. Therefore, our objective is to investigate how refactoring subgraphs increase over time. This investigation complements the perspective of previous studies, which rely on refactoring operations detected in a single commit or in a short time interval (Sousa et al. 2020; Murphy-Hill et al. 2009).

  • (RQ3) What is the age of refactoring subgraphs? We investigate the lifetime of subgraphs, i.e., the interval between the first and the latest refactoring operation in a subgraph. For example, this investigation might also provide insights about large and long-running changes in the design/architecture of the studied systems.

  • (RQ4) Which are the most common refactoring operations in refactoring subgraphs? In this RQ, we discuss the most recurring refactoring types that occur over time, complementing the panorama of studies that report the frequency of single-commit operations (Tsantalis et al. 2020; Silva et al. 2021; Silva et al. 2016; Pantiuchina et al. 2020). We also analyze the homogeneity of refactoring subgraphs. In other words, we investigate the frequency of subgraphs formed by the same or distinct refactoring types.

  • (RQ5) Are refactoring subgraphs created by the same or by multiple developers? The rationale of this research question is to investigate whether refactoring operations over time are performed by distinct developers. That is, we aim to assess whether refactoring operations over time are concentrated on single developers or spread over multiple ones.

  • (RQ6) What are the most common refactoring subgraphs? This research question provides an overview of recurrent graphs in distinct projects, i.e., refactoring graph patterns that occur frequently in our dataset.

4.1.1 Selecting projects

In this paper, we analyze the characteristics and frequency of refactoring subgraphs in popular Java and JavaScript systems. We used the following criteria for selecting the projects for each programming language. First, the projects should be among the top-100 GitHub repositories in terms of stars, since stars is a key metric to reveal the popularity of repositories (Borges et al. 2016; Borges and Valente 2018). Second, the project should have more than 1K commits (in order to remove recent systems with a short history of refactoring activity). Finally, the project should be a software system. Thus, we removed, for example, code samples (such as iluwatar/java-design-patterns)Footnote 5 and JavaScript style guides (such as airbnb/javascript).Footnote 6 Table 2 describes the selected projects, including basic information, such as number of stars, commits, files, contributors, and short description. These projects cover distinct domains, including web development systems and media processing libraries, for example.

Table 2 Selected projects (Java and JavaScript)

4.1.2 Detecting refactoring operations

As mentioned in Section 2, we use refdiff (Silva and Valente 2017; Silva et al. 2021) to detect the refactoring operations represented in refactoring graphs. refdiff identifies refactorings between two versions of a git-based project. In our study, we focus on well-known refactoring operations detected by refdiff at the method or function level, as presented in Figs. 3 and 4. refdiff works by comparing each commit with its previous version in history. To avoid analyzing commits from temporary branches, we focus on the main branch evolution. Particularly, we use the command git log –first-parent to get the list of commits of each project.Footnote 7 Additionally, we remove refactorings in packages that are not part of the core system. For Java projects, we remove refactorings from packages with the keywords test(s), example(s), and sample(s). In JavaScript, we also filter other keywords. For instance, we discarded refactorings from the package dist, since it is frequently used to store source code for distribution. Other cases are specific from a single JavaScript system. For example, in Vue, we remove refactorings from packages/vue-server-renderer since the documentation mentions: “This package is auto-generated”.Footnote 8

4.1.3 Building refactoring graphs

As mentioned earlier, we identify refactoring subgraphs over time in 20 systems. Algorithm 1 presents the steps to build refactoring graphs. The input comprises a list of refactorings, e.g., util.Foo#m() moved to util.Bar#m(). First, the algorithm identifies each refactoring t and the two methods involved, m1 and m2 (line 3). Then, it creates a directed edge representing this refactoring (line 5). Since V and E are sets, each element is represented only one time. The edges are labeled with refactoring’s name t. The output includes the sets of refactoring subgraphs.

figure b

Table 3 presents the frequency of refactoring subgraphs for each Java project, and Table 4 presents the results for JavaScript. Considering both languages, we detect a total of 11,341 refactoring subgraphs. In the case of Java, we detect 9,200 subgraphs, whereas 2,141 for JavaScript.

Table 3 Frequency of refactoring subgraphs (Java)
Table 4 Frequency of refactoring subgraphs (JavaScript)

Spring Framework has the highest number of subgraphs (3,206), while Square Retrofit has the lowest amount (182). Overall, 87% of the refactoring subgraphs comprise operations performed in a single commit. This ratio varies from 67.1% (Glide) to 93% (Apache Dubbo). The results follow a similar trend in JavaScript systems. The percentage of single-commit subgraphs ranges from 67.9% (Carbon) to 93.1% (Nylas Mail).

From RQ1 to RQ5, we assess 1,525 subgraphs with number of commits ≥ 2, because they are the ones that represent refactorings over time.

4.1.4 Mining frequent graphs

In our last research question (RQ6), we investigate frequent graphs, i.e., graph patterns that occur frequently in our dataset. For this analysis, we use GSpan, a well-known algorithm that identifies subgraphs whose incidence is greater than a given support (Yan and Han 2002; Leung 2010). Figure 5 shows a simple example of graph pattern. For instance, suppose that GSpan reports the operation move method followed by a rename method as a pattern that occurs repeatedly in our dataset. As we can notice, G1 contains this pattern (grey vertices), which refers to two distinct commits over time.

Fig. 5
figure 5

Example of over time graph patterns

G2 illustrates a second example, a pattern with three extract method operations, as shown in Fig. 6. However, in this case, there are two possible situations: (i) the three extract operations were performed in a single commit, or (ii) the extract operations were performed in multiple commits over time.

Fig. 6
figure 6

Example of possibly atomic graph patterns

Therefore, there are two categories of refactoring patterns: possibly atomic refactoring patterns and over time refactoring patterns. Over time patterns represent frequent refactorings performed in distinct commits (e.g., G1). In contrast, possibly atomic patterns can be detected in single or multiple commits. In other words, we cannot safely infer they include refactorings over time (e.g., G2).Footnote 9 Besides that, GSpan can report more than one pattern in the same subgraph. For instance, the algorithm can identify a pattern with two extract operations and a second pattern with three extract operations in G2.

Finally, it is also worth noting that refactorings graphs might have cycles, as in the example of Fig. 7. In this subgraph, the extract refactorings were performed in the same commit. After that, in a second commit, one of the extract was reverted using an inline operation. If we do not take precaution, GSpan might detect the following pattern in this graph: inlineextract (assuming this pattern also happens in other subgraphs). However, this is a misleading pattern, since the inline happened before the extract. As the reader might have already concluded, misleading patterns are only possible when at least one edge is part of a cycle. For this reason, in order to answer RQ6, we implemented a script to identify and remove subgraphs with cycles from our dataset. As a result, we discarded 289 subgraphs in Java (3%) and 47 subgraphs in JavaScript (2%).

Fig. 7
figure 7

Example of an misleading refactoring graph pattern (inlineextract)

Since patterns can be detected in any number of commits (i.e., even in a single commit), in RQ6, we do not separate the dataset by the number of commits. As a result, in this RQ, we assess 8,911 subgraphs in Java,Footnote 10and 2,094 subgraphs in JavaScript.Footnote 11 We fixed support = 13 (Java) and support = 8 (JavaScript). We set these thresholds after experiments where we strived to balance two variables: execution time and a reasonable number of occurrences that would allow us to classify the retrieved graphs as patterns. The threshold for JavaScript is lower because the number of graphs we mined for this language is also lower.

4.1.5 Overview of data collection and analysis

Table 5 presents an overview of the dataset we use to address the research questions. RQ0 provides an introductory analysis, considering the frequency of multiple-commits operations in the subgraphs over time. From RQ1 to RQ5, we work on the same sample, which includes 1,525 refactoring subgraphs over time (1,198 Java and 327 JavaScript). In the case of RQ6, we consider all subgraphs without cycles to investigate refactoring graph patterns.

Table 5 Numbers of the quantitative study

4.2 Results

4.2.1 (RQ0) How many refactoring operations generate subgraphs over time?

In this first research question, we provide an overview of the refactoring operations in our sample. Specifically, we discuss how many refactorings result in subgraphs over time. As presented in Table 6, for Java, 29.3% of the operations are part of a refactoring subgraph over time (3,853 occurrences).

Table 6 Frequency of refactoring operations in subgraphs (Java)

In the case of JavaScript, this rate is 32.4% (902 occurrences), as shown in Table 7. Interestingly, in three projects, more than 50% of the detected refactorings correspond to edges of subgraphs overtime: Carbon (53.7%), Request (60.2%), and Glide (56%).

figure c
Table 7 Frequency of refactoring operations in subgraphs (JavaScript)

4.2.2 (RQ1) what is the size of refactoring subgraphs?

As presented in Fig. 8, in Java, most refactoring subgraphs have three vertices (630 occurrences, 53%). The other recurrent cases comprise subgraphs with two (19%) or four vertices (13%). Square Okhttp holds the largest subgraph regarding the number of vertices (57), which are most related to inline operations. Concerning the number of edges, most subgraphs have two (67%) or three edges (16%). MPAndroidChart has the largest subgraph in terms of edges. It has 61 edges, most representing extract and move operations. Therefore, most subgraphs contain few methods (vertices) and refactoring operations (edges).

Fig. 8
figure 8

Size of refactoring subgraphs (Java)

Figure 9 shows a real example of a refactoring subgraph from MPAndroidChart, which includes three distinct refactoring operations. In the first commit C1, a developer renamed method drawYLegend() to drawY Labels().Footnote 12 In the subsequent commit performed 13 days later, the same developer extracted a new method from drawY Labels() at commit C2.Footnote 13 Two days after the second operation, in commit C3, he made new extractions from drawY Labels() to another class, creating a subgraph with five vertices and four edges.Footnote 14

Fig. 9
figure 9

Example of a refactoring subgraph from MPAndroidChart (Java)

In the case of JavaScript, most subgraphs also have three vertices (57%), as shown in Fig. 10. Other common cases refer to subgraphs with two (11%) or four vertices (18%). Regarding the number of edges, the subgraphs also are small, 92% of them involve up to four edges.

Fig. 10
figure 10

Size of refactoring subgraphs (JavaScript)

Figure 11 presents an example of a refactoring subgraph from Quill, which includes five edges and three distinct refactoring operations. In commit C1, a developer renamed function formatCursor to format.Footnote 15 Seven months later, in commit C2, the same developer made four extract operations to function isEnable, aiming the removal of a single duplicated line.Footnote 16

figure d
Fig. 11
figure 11

Example of a refactoring subgraph from Quill (JavaScript)

4.2.3 (RQ2) How many commits are represented in refactoring subgraphs?

In this second question, we investigate the number of commits per subgraph. As presented in Fig. 12, most cases include subgraphs with two or three commits. In Java, 95% of subgraphs (1,135 occurrences) are created from up to three commits. The largest subgraph in terms of commits is again from Square Okhttp (18 commits). Similarly, in JavaScript, 93% of subgraphs (304 occurrences) also comprise two or three commits.

Fig. 12
figure 12

Number of commits by refactoring subgraph

Figure 13 shows an example from Elasticsearh. In commit C1, a developer moved two methods from class SocketSelector to NioSelector.Footnote 17 After approximately three months, in commit C2, a second developer extracted duplicated code from three methods to a new method named handleTask(Runnable).Footnote 18 Among the source methods, two methods are the ones moved early. As a consequence, these two commits create a refactoring subgraph with six vertices and five edges.

figure e
Fig. 13
figure 13

Example of a refactoring subgraph from Elasticsearch (Java)

4.2.4 (RQ3) What is the age of refactoring subgraphs?

To assess interval, we compute the number of days between the most recent and the oldest commit in a subgraph. Figure 14 presents the results for Java. Considering the median of the distributions, the youngest subgraphs are found in Lottie Android and RxJava, which are 3 and 3.4 days, respectively. On the other side, the oldest subgraphs are found in Glide (489.8 days), Spring Framework (121.9), and Fresco (167.8). The other systems have subgraphs with age between 45.4 (Elasticsearch) and 84 days (MPAndroidChart). Regarding the maturity of the target systems, the youngest project is Lottie Android (3 years) while the oldest one is Elasticsearch (9 years). We run the Spearman’s test to assess the correlation between the systems age and the median time of their refactoring subgraphs. The correlation coefficient (rho) is 0.115, suggesting a negligible correlation (Borges and Valente 2018; Hinkle et al. 2003). In other words, there are subgraphs with different ages in both old and young systems.

Fig. 14
figure 14

Age of the refactoring subgraphs (Java)

Figure 15 presents the distribution of age in JavaScript. Considering the median, the youngest subgraphs are from Carbon and Nylas Mail, with approximately 25 days. In contrast, there are also older subgraphs. For instance, in Hexo, the median is around four years. Thus, the age of refactoring subgraphs also diverse in JavaScript. Spearman’s test suggest a moderate correlation in our sample (rho = 0.624). In other words, the older the system, the older its the median time of their refactoring subgraphs.

Fig. 15
figure 15

Age of the refactoring subgraphs (JavaScript)

Figure 16 shows an example of a subgraph describing refactorings performed in few days on Spring Framework. In commit C1, a developer renamed method before(...) to filterBefore(...).Footnote 19 After six days, the same developer reverted the operation in commit C2, renaming filterBefore(...) to the original name.Footnote 20 Figure 17 presents a second example, a subgraph with more than one year in Vue. The first operation occurs in August 2016, in commit C1, when a developer extracts a function from createElm.Footnote 21 The same developer performs more three operations during 15 months, extracting functions createChildren,Footnote 22createComponent,Footnote 23 and isUnknownElement.Footnote 24

figure f
Fig. 16
figure 16

Example of a refactoring subgraph from Spring Framework (Java)

Fig. 17
figure 17

Example of a refactoring subgraph from Vue (JavaScript)

4.2.5 (RQ4) Which are the most common refactoring in refactoring subgraphs?

Table 8 presents the most common refactoring operations in Java. Most cases include rename method (20%), move method (18%), and extract and move method (17%). By constrast, we detected only 92 occurrences of move and rename operations. There are also few inheritance-based refactorings, i.e., pull up (369 occurrences) and push down (148 occurrences).

Table 8 Frequency of refactoring operations (Java)

We also divided our sample of 1,198 subgraphs into two groups. The homogeneous group includes subgraphs with a single refactoring operation. In contrast, the heterogeneous group comprises subgraphs with at least two distinct refactoring operations. As presented in Table 9, around 28.9% of the subgraphs are homogeneous, while 71.1% are heterogeneous. The results per system follow a similar tendency. Most of the projects have more heterogeneous subgraphs than homogeneous ones; the sole exception is RxJava (52.3% vs 47.7%). In addition, as presented in Fig. 18, heterogeneous subgraphs often include two distinct refactoring types (60%); in contrast, 8% have three and only 3% have four or more distinct refactoring types.

Table 9 Homogeneous vs heterogeneous refactoring subgraphs (Java)
Fig. 18
figure 18

Number of distinct refactorings by subgraph (Java)

As shown in Table 10, in JavaScript, 76% of the refactorings refer to extract, move, and rename operations. There are also 88 occurrences of internal move operations, that is, the movement of nested functions into a single file. Among the 902 refactorings, 628 cases (69.6%) denote to heterogeneous subgraphs, which is the largest group, as presented in Table 11. Besides that, as shown in Fig. 19, heterogenous subgraphs frequently include two distinct refactoring operations, following the same tendency of Java subgraphs.

Table 10 Frequency of refactoring operations (JavaScript)
Table 11 Homogeneous vs heterogeneous refactoring subgraphs (JavaScript)
Fig. 19
figure 19

Number of distinct refactorings by subgraph (JavaScript)

Figure 20 shows an example of a homogeneous subgraph from Facebook Fresco. In this case, the subgraph represents four extract operations performed over time. First, in commit C1, a developer extracted fetchDecodedImage(...) from two methods into class ImagePipeline.Footnote 25 The next operations happened years later when a second developer made two new extract operations in commits C2Footnote 26 and C3.Footnote 27

Fig. 20
figure 20

Example of a homogeneous refactoring subgraph from Facebook Fresco (Java)

As a second example, we present a heterogenous subgraph from Parcel in Fig. 21. In this case, a single developer performed three distinct operations in nine months by renaming function resolveModule to resolveAsset,Footnote 28 moving it to another file,Footnote 29 and extracting function getLoadedAsset.Footnote 30

Fig. 21
figure 21

Example of a heterogenous refactoring subgraph from Parcel (JavaScript)

figure g

4.2.6 (RQ5) Are refactoring subgraphs created by the same or by multiple developers?

In the fifth question, we separate the refactoring subgraphs into two groups. The first group includes subgraphs with refactoring operations performed by a single developer. The second category is the opposite; it holds subgraphs by multiple developers. As presented in Table 12, in Java, most subgraphs have a single author (61.4%). It is also possible to notice a similar tendency in JavaScript, i.e., 203 subgraphs (62.1%) include refactoring operations performed by a sole developer, as shown in Table 13.

Table 12 Developers by refactoring graphs (Java)
Table 13 Developers by refactoring graphs (JavaScript)

Figure 22 presents an example of a refactoring subgraph from Square Okhttp. First, in commit C1, developer D1 renamed three methods from class OkHttpClient.Footnote 31 Basically, the developer removed the prefix set from their names. After 10 months, a second developer D2 removed a duplicated code from these methods, extracting method checkDuration(...).Footnote 32 Then, after seven months, D2 moved this method to a new class named Util, in commit C3.Footnote 33 As a result, these two developers are responsible for a refactoring subgraph with eight vertices and seven edges. Figure 23 shows an opposite scenario, a subgraph from Facebook React, which was created by a single developer. After performing five inline operations,Footnote 34 the developer renamed a function, adding the prefix deprecated.Footnote 35

Fig. 22
figure 22

Example of a refactoring subgraph created by multiple developers in Square Okhttp (Java)

Fig. 23
figure 23

Example of a refactoring subgraph created by a single developer in Facebook React (JavaScript)

figure h

4.2.7 (RQ6) What are the most common refactoring subgraphs?

In this last research question, we mine frequent refactoring patterns. Specifically, we search for patterns that occur frequently in our dataset.

As presented in Table 14, in Java, we detect a total of 38 patterns using GSpan (Yan and Han 2002). Most cases refer to over time patterns (60.5%, 23 occurrences), i.e., patterns that happen over multiple commits. In contrast, 15 patterns (39.5%) refer to possibly atomic patterns, that is, they can happen in single or multiple commits.

Table 14 Refactoring patterns

Figure 24 shows the distribution of the 38 patterns by the number of distinct projects and their support in Java. Interestingly, four patterns appear in all studied systems. Furthermore, 75% of the patterns occur in up to eight projects, and support values range from 14 to 153.

Fig. 24
figure 24

Patterns distribution (Java)

In JavaScript, GSpan reports 15 patterns, 11 of then in the over time category (73%). Figure 25 presents the distribution of the detected patterns. The support median is 18, varying from 8 to 50.

Fig. 25
figure 25

Patterns distribution (JavaScript)

In the remainder of the section, we provide an analysis of refactoring patterns considering their number of vertices. As shown in Table 14, this number ranges from three to five vertices.

Refactoring graph patterns with three vertices

As we can observe in Table 14, in Java, all over time patterns have three vertices. Figure 26 shows the top-5 over time patterns in terms of support. Interestingly, the most recurrent patterns are homogeneous, that is, they refer to successive rename operations (P1, 153 occurrences) and move operations (P2, 65 occurrences). In fact, P1 appears in all studied Java systems.

Fig. 26
figure 26

Top-5 over time graph patterns

Figure 27 presents a subgraph from Glide with pattern P1. A single developer performed the operations that represent the over time pattern in commits C1 and C2. First, he renamed buildStreamOpener to buildStreamLoader.Footnote 36 The developer repeated the same operation ten days later, replacing the prefix build by get in the method’ name.Footnote 37

Fig. 27
figure 27

Example of a refactoring graph pattern from Glide (Java, 153 occurrences)

In the case of JavaScript, support values are lower due to the sample size. However, the results show a similar tendency. All over time patterns have three vertices, as shown in Table 14. Besides, as presented in Fig. 26, the top-2 patterns are homogeneous.

Refactoring graph patterns with four vertices

In both languages, all patterns with four vertices belong to the possibly atomic group. Figure 28 presents an example from Spring Framework. This graph describes multiple extract operation from method processConstraintV iolations(...) to three methods.Footnote 38 This pattern occurs in 19 subgraphs in our dataset.

Fig. 28
figure 28

Example of a possibly atomic graph pattern from Spring Framework (Java, 19 occurrences)

Refactoring graph patterns with five vertices

In Java, the sole graph pattern occurs in 16 subgraphs and it includes four inline operations. Figure 29 shows a refactoring subgraph from RxJava with this pattern (P7). In this subgraph, the inline operations involve the removal of method threadPoolForComputation, and replacement of the respective calls in six methods.Footnote 39 There are no occurrences of patterns with five vertices in JavaScript.

figure i
Fig. 29
figure 29

Example of a refactoring graph pattern from RxJava (Java, 16 occurrences)

5 Qualitative study: Investigating large subgraphs

5.1 Survey design

As we reported in Section 4, most subgraphs are small in terms of their number of vertices, edges, and commits. For this reason, we showed small examples when discussing our quantitative RQ results. However, we also found subgraphs describing major refactoring operations. Therefore, the goal of this second study is to qualitatively analyze such subgraphs, with the purpose of investigating the motivation behind large refactoring operations performed over time. Specifically, we conducted a survey with the developers responsible for these refactorings. The context of the study consists of nine developers’ feedback about 66 refactoring operations from eight subgraphs. These subgraphs represent the top-1% largest graphs in our dataset, by number of vertices.

5.1.1 Selecting refactoring subgraphs

We started by selecting the top-1% subgraphs by the number of vertices per programming language. In this way, for Java, we picked subgraphs with at least seven vertices, resulting in 132 instances. In the case of JavaScript, the top-1% refer to 27 subgraphs with at least six vertices. For both languages, we ordered the subgraphs by the number of vertices and we executed the following steps for each one:

  1. 1.

    We identified the authors of the commits associated with the subgraph. If one of the developers selected in this step was previously contacted, we also discarded her. Our goal is to avoid sending more than one email per developer, reducing the perception of our survey as spam.

  2. 2.

    In this last step, we manually inspected the selected subgraphs to confirm whether the edges and vertices refer to true positives operations. As a result, we cleaned the subgraphs by removing false positive edges. Lastly, after those filtering steps, we contacted the authors.

We manually inspected 50 subgraphs (33 in Java and 17 in JavaScript), comprising 16 distinct projects.Footnote 40 In Java, the 33 subgraphs refer to 557 refactorings, which were detected by RefDiff in 120 commits, as shown in Table 15. Overall, the tool presents a high precision: 486 out of 557 (87%) refactorings are true positives. For instance, the precision for extract and move method is 93%, which is the most frequent refactoring operation (243 occurrences).

Table 15 Precision (Java)

5.1.2 Contacting developers

From July to August 2020, we sent emails to 62 developers asking for the motivations behind the refactoring subgraphs (see the template in Fig. 30). In the emails, we added a short description of our research goals and a screenshot of the subgraph they are responsible for. We also implemented a web app to navigate the graph structures, i.e., by using this app, our survey participants could check the vertices names, edges, and commits. Therefore, we included a link to the surveyed subgraphs in the survey message, as in the following example from Elasticsearch: https://refactoring-graph.github.io/#/elastic/elasticsearch/713

Fig. 30
figure 30

Email sent to the authors of refactoring subgraphs

We followed the same steps in JavaScript by inspecting 133 refactoring operations in 60 distinct commits, as presented in Table 16. We notice that the overall precision is also high (93%). For instance, the most common refactoring operation is extract function (79 occurrences), whose the precision is 97%.

Table 16 Precision (JavaScript)

Table 17 summarizes the numbers and statistics about this qualitative study, as previously described in this section. We received nine answers, which represents a response ratio of 15%. Each of them corresponds to the developer’s motivation to perform a set of refactorings. In a single case, the developer did not remember the motivation to perform the refactorings because it involved old commits. Overall, the answers are from relevant open-source developers. For example, we received replies from developers working in VMware, Elasticsearch, and Square. Besides, seven developers are among the top-10 contributors in the studied systems. In summary, our qualitative study contains answers from 66 refactorings instances represented in seven refactoring subgraphs. We used labels D1 to D9 to designate the developers and their responses and labels G1 to G7 to indicate the subgraphs.

Table 17 Numbers of the qualitative study

5.2 Survey results

As presented in Table 18, the survey answers suggest two major reasons behind large refactoring subgraphs. In the following paragraphs, we explain and provide examples for each motivation.

Table 18 Reasons to perform large refactoring subgraphs

Improve code design

With 30 edges and two subgraphs, this category was inspired by a recent theme proposed in the literature (Pantiuchina et al. 2020). Essentially, it groups large refactoring operations to improve maintainability or encapsulation. As examples, we have the following answers from two authors of the same subgraph, which is shown in Fig. 31.Footnote 41

Fig. 31
figure 31

Example of a large subgraph from Request (G1, JavaScript)

In the first answer, D2 performed two refactoring operations by extracting a function and moving it to a distinct file. Similarly, D3 also moved a function. In their answers, the developers emphasized their major motivation was to improve the code design:

“Specifically in the case of [Function Name] all of the code was in a single file. The first step toward making it more maintainable is by reducing scope, also known as encapsulation. (...) I moved [Function Name] out, and a bunch of other functions into separate modules in order to reduce scope, or at least try to minimize it (...)” (D2, 2 refactorings in subgraph G1)

“It was a large file. It is easier to maintain by separating in several components (...)” (D3, 1 refactoring in subgraph G1)

Figure 32 shows a second example in this category.Footnote 42 In this case, the author of one move operation and 26 extract and move operations points that the major reason was to migrate parts of the code to the appropriate container:

“Most of the refactorings here move code that’s logically related to also be physically related.” (D6, 27 refactorings in subgraph G4)

Fig. 32
figure 32

Example of a large subgraph from Square Okhttp (G4, Java)

Fix bugs or improve existing features

In five answers (56%), developers essentially mention opportunistic refactorings performed during changes to fix bugs or improve features, which are also reported in a recent study (Paixao et al. 2020). This category includes 35 refactorings located in five distinct subgraphs. As a first example, we show an answer related to several extract and move operations performed to create two methods, as represented in the subgraph in Fig. 33.Footnote 43D5 explains his motivation was to improve the usage of events subscription feature:

“I did those to make sure that empty/error cases use the right objects and call the right methods everywhere they are needed. In addition, they would now indicate in the original method that there are no extra actions intended to be performed on those code paths.” (D5, 15 refactorings in subgraph G3)

Fig. 33
figure 33

Example of a large subgraph from RxJava (G3, Java)

D8 also points to the maintainability of a feature by pushing down a method to nine subclasses, as presented in Fig. 34.Footnote 44 In this example, the goal is to support a non-mutable communication option:

“We have a concept in [Project Name] used for reading/writing objects when forming requests/responses for inter-node communication. That concept originally depended on using default constructors, with mutable members (...) In order to allow non mutable state in these requests/responses, we changed this model (...) I found there were many layers at the top of the hierarchy of classes that were no longer needed (...) The change referenced here was to remove the [Method Name] from base classes that no longer contained any logic.” (D8, 9 refactorings in subgraph G6)

Fig. 34
figure 34

Example of a large subgraph from Elasticsearch (G6, Java)

As a last example involving fixing an existing thread-related bug, we show D7’s answer. In this case, the developer performed the refactorings to provide a safe mode to instantiate a class, generating the subgraph in Fig. 35:Footnote 45

“We pushed everything from the front-facing API class (...) that enabled us to call the existing [Class Name] thread safe because each use of it would now create and use a new instance (...) Prior to the change if two threads had the same [Class Name] and called parse at the same time, I think it would get into a mess.” (D7, 6 refactorings in subgraph G5)

Fig. 35
figure 35

Example of a large subgraph from Spring Framework (G5, Java)

Finally, in two answers, the motivation is also related to fixing bugs:

“(...) I centralized some repeated code around timeouts and fixed a bug where it wasn’t cleared properly.” (D1, 3 refactorings in subgraph G1)

“I was doing closure elimination and memory leakage fix in the two refactoring (...)” (D4, 2 refactorings in subgraph G2)

6 Discussion and implications

Refactoring over time & programming languages

In this paper, we analyzed refactoring graphs in two different programming languages: JavaScript and Java. These languages have distinct styles. Java is a strongly-typed and object-oriented programming language, while JavaScript is an interpreted and dynamic language. Despite their distinct properties, our results regarding refactoring operations over time are similar in both languages, as summarized in Table 19. For example, in both languages, most subgraphs are small (RQ1) and heterogeneous (RQ4). On the other hand, there is a significant variation in the absolute number of detected refactoring subgraphs. We found 1,198 subgraphs over time in Java and 327 subgraphs in JavaScript. However, considering the relative rate, the results remain similar (13% in Java, 15% in JavaScript).

Table 19 Summary of refactoring graphs properties

Detecting refactorings over time

Several tools and techniques are proposed in the literature to detect refactoring operations, such as Refactoring Crawler (Dig et al. 2006), RefFinder (Kim et al. 2010), Refactoring Miner (Tsantalis et al. 2013; Silva et al. 2016), and, more recently, RefDiff (Silva and Valente 2017) and RMiner (Tsantalis et al. 2018; Tsantalis et al. 2020). In common, those approaches only detect atomic refactorings, i.e., operations that happen in a single commit and performed by a single developer. However, as presented in Section 4, there is a significant rate of refactoring operations spreading over multiple commits (RQ0). In contrast, our approach, refactoring graphs, focuses on the detection of refactorings over time, i.e., operations over multiple commits and performed by multiple developers. Moreover, differently from the batch refactoring (Murphy-Hill et al. 2009; Bibiano et al. 2019; Cedrim 2018), our approach is not constrained by the number of developers nor to a time window. Indeed, we found refactoring subgraphs with age ranging from weeks to months (RQ3) and created by multiple developers (RQ5). Therefore, we contribute to the refactoring literature with a novel approach to detect and explore refactoring operations in a broader perspective to complement existing tools and techniques. In addition, these tools do not cluster refactoring operations performed in multiple steps. For example, suppose a developer extracted class F oo from class Bar in commit C1. In this case, the tool used in this paper detects an Extract Class, since the refactoring generates a new entity. However, if she keeps moving methods from Bar to F oo in the next commits, the tool does not group these operations. Instead, it reports them as isolated move operations. Therefore, we also envision studies on new strategies to cluster or group related refactorings performed in multiple steps. Besides, it would be interesting to evaluate the impact of such “missing” operations in the results and findings of previous empirical studies that relied on atomic refactoring detection tools (Bibiano et al. 2019; Sousa et al. 2020; Hora et al. 2018; Paixao et al. 2020; AlOmar et al. 2021; Vassallo et al. 2019; Brito et al. 2018).

Refactoring comprehension and improvement

When performing code review, developers often adopt diff tools to better understand code changes, and decide whether they will be accepted or not. In this process, developers may also look for defects and code improvement opportunities (Bacchelli and Bird 2013). However, if the reviewed change is large and complex, this task becomes challenging (Bacchelli and Bird 2013). To alleviate this issue, refactoring-aware code review tools were proposed (Hayashi et al. 2013; Ge et al. 2014; Ge et al. 2017; Brito and Valente 2021) to better understand changes mixed with refactorings. Refactoring graphs can contribute to handle this issue by providing navigability at method level. That is, a code reviewer may navigate back in a method to reason how a similar change was performed. For example, in Fig. 22, a code reviewer may investigate whether all methods were properly renamed in the past, before accepting commit C3. Thus, refactoring graphs can be integrated to code review tools to better support code understating and improvement.

Detecting refactoring patterns and smells

In our qualitative study, we investigated subgraphs describing large refactoring operations (RQ1). As we can notice, these subgraphs may represent the improvement of pieces of code. For instance, Fig. 32 shows a large subgraph from our dataset. Among the refactoring instances, there are 21 extract method operations, generating a single method with two lines of code. This method is represented as a node in the subgraph (in the bottom), which is the node with the highest in-degree, i.e., the highest number of edges coming to it. Therefore, it may indicate a pattern to move a specific duplicated code to an appropriate container. In addition, there is an interesting question in this context: could the developer extract these two lines from another part of the project? In other words, should the graph have more edges? In the same way, a high out-degree of a node, i.e., a high number of edges leaving it, can suggest an anomaly on a method. For example, Fig. 17 shows a subgraph with four extract operations from a single method. In this case, it is probably a frequent behavior during a method evolution, since in RQ6, we identify refactoring graph patterns that are formed by three extract operations (Fig. 28). However, a method which is decomposed several times over time (i.e., high out-degree) can reveal a code design problem. Thus, refactoring graphs can foment the detection of refactoring anomalies over time and drive future research agenda on refactoring patterns.

Understanding and assessing software evolution

During software evolution, developers often perform refactoring operations. Consequently, the link between methods may be lost (Hora et al. 2018). For example, if a method a() is renamed to b() and then extracted to c(), it becomes quite hard to trace a() to c(), and vice versa. This has several implications to software evolution research, particularly on studies that assess multiple code versions, such as code authorship detection (Avelino et al. 2016; Rahman and Devanbu 2011; Meneely and Williams 2012; Spinellis 2017; Hattori and Lanza 2009), code evolution visual supporting (Gómez et al. 2010; 2015), bug introducing change detection (Kim et al. 2006; Zimmermann et al. 2006; Rahman et al. 2011; Chen et al. 2014; Ray et al. 2016), to name a few. In practice, these studies often rely on tools provided by Git and SVN, such as git blame and svn blame, which show what revision and author last modified each line of a file. However, this process is sensitive to refactoring operations (Avelino et al. 2016; Hora et al. 2018). As Git and SVN tools cannot track fine-grained refactoring operations, particularly at method level, these approaches may miss relevant data. For instance, in the aforementioned example, it would be not possible to detect that method c() was originated in method a(). Consequently, we would be not able to find the real creator of method c() nor the developer who introduced a bug on c(). As shown in Section 4, most subgraphs are small (RQ1) and have few commits (RQ2), suggesting that the whole history of the elements may contain a few ruptures due to refactoring. However, it still may reflect a significant impact on the retrieval of source code changes (Grund et al. 2021; Hora et al. 2018). With refactoring graphs, we are able to resolve method names over time, thus, software evolution studies can benefit as more precise tools can be created on the top.

7 Threats to validity

Generalization of the results

We analyzed 1,525 refactoring subgraphs from 20 popular and open-source Java and JavaScript systems. Therefore, our dataset is built over credible and real-world software systems. Our qualitative study reinforces recent results about motivations to refactor a source code (Paixao et al. 2020; Silva et al. 2016; Pantiuchina et al. 2020), which were reported in another contexts. Also, the motivations are based on answers from relevant contributors to the open-source community. Despite these observations, our findings—as usual in empirical software engineering—may not be directly generalized to other systems, particularly commercial, closed source, and the ones implemented in other languages than Java and JavaScript. Finally, we focus our study on eight refactorings at method level (Java) and eight refactorings at function level (JavaScript). Thus, other refactoring types can affect the size of subgraphs. We plan to extend this research to cover software systems implemented in other programming languages and refactorings at class level.

Adoption of refdiff

We adopted refdiff to detect refactoring operations because it is the sole refactoring detection tool that is multi-language, working for Java, JavaScript, C, and Go (Brito and Valente 2020; Silva et al. 2021). It is also extensible to other programming languages. In our first study (Brito et al. 2020), we concentrated on Java systems. In this second study, we include refactoring subgraphs in JavaScript. Thus, as we planned to extend this research to cover other programming languages than Java, refdiff was the proper solution. Besides, despite being multi-language, refdiff accuracy is quite high. For example, in the current version (Silva et al. 2021), the authors provide an evaluation of the tool for three languages: Java (precision: 96.4%; recall: 80.4%), JavaScript (precision: 91%; recall: 88%), and C (precision: 88%; recall: 91%). The recent evaluation for Go reports 92% of precision and 80% of recall (Brito and Valente 2020). In our dataset, the tool also presents a high precision for Java (557 refactoring instances; precision: 87%) and JavaScript (133 refactoring instances; precision: 93%). Recently, Tsantalis et al. (2018, 2020) proposed the refactoring detection tool RefactoringMiner. In the current version (Tsantalis et al. 2020), RefactoringMiner has a precision of 99.6% and recall of 94%, improving on refdiff’s overall accuracy. However, RefactoringMiner works only for Java projects. Finally, refdiff detects refactorings using a generic data structure called Code Structure Tree (CST). The generation of this data structure for JavaScript relies on a simplified call graph due to the dynamic nature of the language. This might result in a higher rate of false negatives. However, the authors mention the tool “works well even when the information encoded in the CST is not completely precise”(Silva et al. 2021).

Building refactoring graphs

When creating the refactoring graphs, we cleaned up our data (i.e., vertices and edges) to keep only meaningful subgraphs. For instance, in Java, we removed constructor methods (vertices) from our analysis because they include mostly initialization settings, and do not have behavior as conventional methods. In JavaScript, we removed refactorings in anonymous functions, i.e., functions without a name, since it is necessary to generate the vertices in the refactoring subgraphs. We also removed some very specific cases of refactoring (edges) in which refdiff reported operations in same element. However, these cases are not likely to affect our results because they only represent a fraction of the refactoring operations. For example, refdiff detected 89% of the removed operations in anonymous functions in only two systems (Facebook React, 85 occurrences; Hexo, 82 occurrences). Finally, the refactoring subgraphs can include unintentional operations (e.g., reverted commits by automatic deployment systems). To mitigate this threat, we focus our study on the main branch evolution to avoid experimental or unstable versions. Additionally, our results can miss refactoring operations that have not been merged on the main branch. However, as mentioned in previous studies (Hora et al. 2018), this strategy provides a safe overview of the system, avoiding refactorings performed in experimental code. Also, the qualitative study confirmed the selected branches are active ones. For example, developers mentioned large refactoring operations to implement features or improve code design in commits from these branches.

Detection of developers

In RQ5, we investigate the number of developers per refactoring subgraphs. We used the email available on git log to distinguish the author of the commits. Thus, our results can include, for example, the same developer committing with different email addresses. But, we already found that most cases are subgraphs created by a single developer.

Large refactoring graphs motivations

In the qualitative study, the refactoring subgraphs were manually inspected by the first paper’s author. Although this inspection might be an error-prone task, it was carefully performed during about a month. Furthermore, we did not receive complaints from the survey participants about false positives that were not detected in this analysis. Our analysis is also publicly available.Footnote 46

8 Related work

8.1 Studies on refactoring evolution

Refactoring is an usual practice during software evolution and maintenance. Constantly, developers refactor the source code for different purposes (Silva et al. 2016; Wang 2009; Pantiuchina et al. 2020). For this reason, several studies concentrate on this research field (Murphy-Hill et al. 2009; Bibiano et al. 2019; Lin et al. 2019; Dig et al. 2006; Kim et al. 2014; Kim et al. 2016; Szóke et al. 2016; Bavota et al. 2015; Bavota et al. 2012; Dig and Johnson 2005; Shen et al. 2019; Terra et al. 2018; Alves et al. 2014; Lin et al. 2016; Chaparro et al. 2014; Hora and Robbes 2020). Among those, some research focus on assessing sets of related refactoring. Specifically, these studies analyze batch refactorings (Murphy-Hill et al. 2009; Bibiano et al. 2019; Fernandes 2019; Tenorio et al. 2019; Fernandes et al. 2019; Cedrim 2018). Murphy-Hill et al. (2009) analyzed four datasets from different sources, all of these including metadata about the usage of Eclipse IDE. For instance, the dataset named Everyone contains Eclipse refactoring commands used by developers. Based on these datasets, the authors discuss usage and configurations of refactoring tools, frequency of refactoring operations, and commit messages. They also investigated refactorings operations executed in 60 seconds, which are named batches. The authors state that the some refactorings types are more common in batches, such as rename, introduce a parameter, and encapsulate field. Besides that, about 47% of refactorings performed using a refactoring tool happen in batches. However, the baches involve a short period: the study does not investigate refactorings operations that occur in different moments over time.

In another context, Bibiano et al. (2019) point out that sets of related refactorings can solve problems due to code smells. The authors studied 54 GitHub projects and three closed systems. First, they used RMiner tool to detect 13 well-know refactorings (Tsantalis et al. 2018), resulting in 24,893 operations. Then, the authors applied a heuristic to compute batch refactorings, i.e., set of related refactorings (Cedrim 2018). The heuristic includes two main requirements do retrieve a batch refactoring: (i) there are more than two refactoring operations in a single entity and (ii) the operations are from a single developer. The results are 4,607 batch refactorings. Next, the authors used another tool and scripts to identify more than 41K code smell occurrences in these systems. Finally, the authors computed the effect of batch refactorings to remove code smells. The main results show that most batches have only one commit (93%) and two refactoring types. Also, the authors state that batches have a negative or neutral effect on code smells (81%). However, the authors focus on code smells and operations performed by a single developer. In our study, the subgraphs involve refactoring over time (i.e., more than one commit), including subgraphs by multiples developers and different code elements. A second study reuses the heuristic proposed Bibiano et al. (2019) and introduces two new ones (Sousa et al. 2020), which are based on refactorings in the same commit and scope of the operations. As in our first study (Brito et al. 2020), the authors also discuss refactorings properties as the number of commits and refactoring types. However, the study focuses on code smells and a single programming language (Java). Other studies also discuss the impact of batches to eliminate code smells, proposing approaches to reuse or suggest sets of related refactoring operations (Tenorio et al. 2019; Fernandes et al. 2019; Jiau et al. 2013; Bibiano et al. 2020). Thus, they do not focus on related refactoring operations over time.

In his seminal book on refactoring, Fowler (1999) dedicates a chapter—co-authored with Kent Beck—to a similar term called big refactoring. The author points out that most refactorings are atomic, i.e., they are finished in a few minutes. By contrast, big refactorings are performed during months or years. However, in Fowler’s book such refactorings are discussed in the context of large modularization performing to improve the architecture of a system.

Hora et al. (2018) analyze untracked changes during software development. The authors show that refactorings invalidate several tracking strategies to evaluate system evolution. As in our study, they represent evolutionary changes as graphs. In this case, each node refers to a class or a method, and the edges indicate tracked changes (i.e., entities that keep their names after a modification) and untracked changes (i.e., entities that change their names after a refactoring). That is, a graph represents traceable changes or alterations that split the entity’s history. The results point up to 21% of the changes at the method level and up to 15% at the class level are untraceable. By contrast, in our study, the goal is to investigate refactorings performed over long time windows; we do not concentrate on tracked modifications on source code.

Meananeatra (2012) also reports changes during software evolution as graphs. However, the study concentrates on refactoring sequences to remove long methods. The author proposes an approach based on two main criteria to detect an optimal set of refactorings. An optimal refactoring sequence centers on four metrics: number of removed bad smells, size of the refactoring sequence, number of the affected code elements, and the maintainability value (i.e., analyzability, changeability, stability, and testability). The technique represents candidate refactoring sequences as graphs. In this case, a graph contains a root node representing the original method version with smells. Each new node denotes a new method version after a refactoring operation. As in our study, the edges refer to refactorings. By contrast, the nodes represent the same method before and after the changes. Each path in the graph is a candidate refactoring sequence, which can meet the selection criteria. Thus, the study does not focus on real refactorings over time. Instead, the graph model represents steps to decompose a long method.

8.2 Studies on refactoring comprehension

The literature proposes several studies on refactoring comprehension. In this case, the goal involves understanding refactoring activities by investigating, for example, benefits and challenges (Kim et al. 2012; 2014), merge conflicts (Mahmoudi et al. 2019), motivations to refactor a source code (Wang 2009; Silva et al. 2016; Pantiuchina et al. 2020; Peruma et al. 2018), association with technical debt (Iammarino et al. 2019), and refactoring opportunities (Catolino et al. 2020).

Silva et al. (2016) performed firehouse interviews to understand the reasons behind refactoring operations in GitHub projects. Based on 195 developers’ answers, the authors found 44 reasons to refactor methods and attributes in Java. As in our study, the authors contacted GitHub developers by email and used thematic analysis to examine the responses (Cruzes and Dyba 2011). Five refactoring instances are also in our study: extract method, move method, inline method, pull up method, and push down method. Besides that, there are related motivations in our category improve code design (e.g., the movement of elements to an appropriate container). However, in our research, we investigate sets of refactoring operations that generate large subgraphs in Java and JavaScript systems. This is different from the mentioned study, which focuses on motivations behind refactorings performed in a single commit and Java projects. That is, in this study, we explore another perspective, centering on a large set of refactoring activities over time in distinct software ecosystems.

A recent study also assesses motivations behind refactoring instances (Pantiuchina et al. 2020). The authors conducted quantitative and qualitative research on a large scale by analyzing refactoring activities in 150 GitHub projects. In the quantitative part, the authors discuss metrics involving code quality (e.g., number of elements, the coupling between classes), code smells, and process-related factors (e.g., number of commits in releases, number of fixed bugs). The qualitative results extend the catalog proposed by Silva et al. (2016), adding 26 new ones. The motivations are based on discussions in 551 pull requests, as well as comments in the related commits. Our category improve code design is inspired by a core theme proposed by this research, involving the improvement of encapsulation and maintainability. Besides, our category “fix bugs or improve existing features” also incorporates another theme, which is called “Prevent Bugs”. Interestingly, the main authors’ findings point out that 52% of the cases, the discussions do not focus on a particular refactoring, i.e., the developers mention a combination of refactoring operations. However, the study focuses only on operations mentioned in pull requests and Java projects.

Lastly, the improvement of existing features is also reported in a recent study about refactoring operations in the code review process (Paixao et al. 2020). Similar to our results and previous researches (Silva et al. 2016; Pantiuchina et al. 2020; Palomba et al. 2017), the authors mention the occurrence of refactoring operations associated with feature maintenance or bug fixing. The authors also reinforce the idea that refactoring is not a sole operation by investigating sequences in code reviews. The main findings point to extract methods occurring with other refactoring types in the Java ecosystem. In RQ6, we used the Gspan algorithm to investigate refactoring patterns in the subgraphs (Yan and Han 2002). However, in our study, the most recurrent pattern in Java refers to successive rename operations, occurring in 153 subgraphs. Our results also suggest that patterns do not necessarily occur between reviews. That is, refactoring patterns can happen in a single commit, i.e., atomic subgraphs.

9 Conclusion

In this paper, we present refactoring graphs, an approach to assess refactoring operations over time. We analyzed 1,525 refactoring subgraphs from 20 popular systems and two programming languages, Java and JavaScript. We then investigate seven research questions to evaluate the following properties of refactoring graphs: operations over time, size, commits, age, homogeneity, ownership, and patterns. In both languages, the results suggest a similar tendency. We summarize our findings as follows:

  • Approximately 30% of refactoring operations are part of a refactoring subgraph over time.

  • The majority of the refactoring subgraphs are small (four nodes and three edges). However, there also outliers with dozens of nodes and edges.

  • Most refactoring subgraphs have up to three commits.

  • Refactoring subgraphs span from few days to months.

  • Refactoring graphs are often heterogeneous, that is, they are composed by several types of refactoring.

  • Refactoring graphs are mostly created by a single developer.

In the last research question, we mine graph patterns in approximately 9k subgraphs in Java and 2k subgraphs in JavaScript. Our results point to recurring graph patterns over time formed by two edges (e.g., successive rename operations). As a complementary perspective, we also perform a qualitative study with large refactoring subgraphs from our dataset, i.e., subgraphs with several vertices and edges. We contacted the developers, asking for the motivation for their operations. Considering nine developers’ answers, 66 refactoring instances, and seven subgraphs, our results suggest that large refactoring subgraphs are motivated by well-know maintenance activities, involving the improvement of code design, fixing bugs, or the improvement of features. However, it is also important to mention that a single graph may include multiple of such motivations.

Based on our findings, we provided further discussion and implications to our study. Particularly, (i) we discuss our contributions regarding refactoring tools as a novel approach to explore refactoring operations in a broader perspective; (ii) we argue that refactoring graphs can be integrated to code review tools to better support code comprehension; (iii) we claim that refactoring graphs can play a role on the detection of refactoring patterns and anomalies; and (iv) we state the importance of refactoring graphs to resolve method names and support software evolution studies.

Further studies can consider refactoring graphs based on class level; novel approaches to complement existing tools and techniques that focus on atomic refactorings; and also other popular programming languages and ecosystems (e.g., the current refdiff version also supports languages C and Go (Brito and Valente 2020; Silva et al. 2021)). Also, we are planning future studies on using refactoring graphs to track changes at the method level. Specifically, we intend to design and implement an Application Interface Programming (API) for incorporating refactoring graphs in software mining and tracing tools (Grund et al. 2021; Higo et al. 2020; da Cost et al. 2017; Neto et al. 2018; Spadini et al. 2018). In such future studies and tools, we also point out possible improvements in the current refactoring graph design, such as an alternative design that handles cycles and different presentation layouts to distinguish the temporal distance between edges.