Keywords

1 Introduction

Finding and locating the same or similar code snippets in source code files is a fundamental activity in software development and maintenance, and various kinds of software engineering tools or IDEs have been proposed and implemented [19].

A (code) clone is a code snippet that has an identical or similar snippet, and a pair of such snippets is called a (code) clone pair [6]. A large body of scientific literature on clone detection has been published and various kinds of code clone detection tools (detectors) have been developed [18, 20]. These code clone detectors are candidates for finding similar code snippets, but most of those are designed to detect all of the code clone pairs in the target, which are generally excessive to the user who wants to search for a specific query snippet.

It has been reported that grep [8], a character-based pattern matching tool, is widely used in the software engineering practice to find lines with a specific keyword [14, 21], although making a query for a code snippet that spans multiple lines needs some skill and effort.

In this paper we propose a tool, named ccgrep (code clone grep), to find code snippets by using the notion of clone detection and pattern matching. Search queries can be simply code snippets, or code snippets enhanced with meta-tokens having a leading $ that can provide flexibility to narrow or broaden the search query. ccgrep is not an ordinary code clone detector that finds all code clone pairs in the target program but it is a code snippet finder that reports code snippets composing code clone pairs against the query snippet.

ccgrep works on Windows or Unix as a simple but reliable clone detector and pattern matching tool for C, C++, Java, and Python. ccgrep has been applied to various applications, and it showed high scalability and performance for large source-code collection. ccgrep is an Open Source Software system and can be obtained from GitHubFootnote 1.

2 Motivating Example

Some uses of the ternary operator (e.g., exp1 ? exp2 : exp3 meaning the result of this entire expression is exp2 if exp1 is true, otherwise the result is exp3—available in C, C++ and Java) are considered bad practice [23]. For example, the use of a < b ? a : b is arguably harder to read than using min(a,b). Therefore, it might be desirable to replace the ternary operator with a function or macro that returns the minimum value. The following is an example found in the file drivers/usb/misc/adutux.c in the Linux kernel (v5.2.0).

figure a

This line of code should be replaced with a more readable expression (note that the macro min in Linux guarantees no side effects):

  • amount = min(bytes_to_read, data_in_secondary);

We might consider that finding all occurrences of such usage of the ternary operator could be done by clone detectors. A popular clone detector NiCad [7] reports 646 block-level clone classes for the drivers/usb files by the default setting, but no snippet with the ternary operator case is included in the result because it is too small to be detectable.

Alternatively, we would try it with grep but it is not easy. For example, simply executing “grep ’<’” for all 598 files (total 51,6394 lines in C) under drivers/usb produces 16335 matching, including many undesired patterns such as “if (a<b)”, “for (i=0; i<x; ...)”, or “#include <linux/...>”. We could narrow the matches by concatenating grep like,

  • grep ’<’ -r . | grep ’?’ | grep ’:’

However, it still produces 149 matches. Perhaps more problematic is that the expressions could span multiple lines. While it is possible to create a complex regular expression to find these expressions, it would be time-consuming and potentially error-prone.

Ideally, we would like to be able to specify a simple and easy-to-create-and-understand query to find these types of snippets. Therefore in this paper, we propose ccgrep and its query is written simply as:

  • a < b ? a : b

In a nutshell, this query specifies that a variable (represented by a) should be followed by < and then the second variable (represented by b), followed by a ?, followed by the same first variable found, followed by :, followed by the second variable. Also, white spaces and comments should be ignored. This query would match all type 2 clones (mentioned in Sect. 3.3) with consistent variable names such as x<y?x:y but it would not match x<y?x:z.

As a practical application, we have used this query to identify 3 instances of such an expression in Linux’s drivers/usb and submitted patches to replace them with min. Two of those patches have been accepted already into Linux.

3 Overview of Code Clone Query by ccgrep

3.1 Basic Features

The input of ccgrep is the query and the target of the source code files in the same programming language. The output is a list of the matched code snippets in the target. The query and the matched code snippets form clone pairs. The query is a code snippet of single or multiple lines and is composed of the regular tokens in the language and the extended tokes with meta symbols having a leading $. We will describe these based on the classification of the clone types. Formalization of the matching is presented in Appendix and also in [11].

3.2 Query for Type 1 Clone

A Type 1 code clone pair is two code snippets possibly with different spacing, line break, or comment. To find type 1 cloned snippets, a code snippet in the programming language is directly given as the query, with a leading $ for each identifier or literal. Note that in the following examples, we will use Java as the programming language.

figure b

In this case, the query matches a code snippet with a comment, but it does not match the latter case of identifier b. The users do not worry about the white spaces and comments in the query.

3.3 Query for Type 2 Clone

A Type 2 code clone pair is two code snippets with the difference of the replacement of identifiers and literals, in addition to the difference of type 1 clones.

In type 2 matching, a user-defined identifier in the query matches any identifier in the target. The same also applies to literal. This “normalization” of the user-defined names allows very flexible pattern matching to find different identifiers or literals. By default, ccgrep executes so-called Parameterized match [3] or P-match for short, such that if two identifiers (or literals) in the query are the same, then the corresponding tokens in the target must be consistently the same. These normalization and p-match are formally explained in Appendix.

figure c

In the former case, a consistently corresponds to y, but in the latter case, it does notFootnote 2.

3.4 Query for Type 3 Clone

A Type 3 code clone pair is two code snippets with a difference of some statements of addition, deletion, or change, in addition to the distinction of type 2. We employ wild-card tokens in the query, which extend the matching from the original seed tokens. The seed snippet and the matched snippet form a code clone pair of type 3. We can replace a token in the seed snippet with ‘$.’ that matches any single token.

figure d

‘$$’ is a wild-card token to match zero or more tokens before the next token matches.

figure e

The following is a more complex example.

figure f

3.5 Finding Various Code Snippets

Combining the regular tokens and meta-tokens in the query, we can find many different kinds of code patterns in the target, from simple to complex ones.

figure g

In addition to finding these patterns, one of the usable use-cases would be a copy-and-paste code search. A developer finds a bug in a system and locates the snippet that causes the defects. She would want to find the same or similar occurrences of the bug in the system, then she copies the buggy snippet and runs ccgrep with the pasted snippet as the query. Then she instantly gets type 2 clone snippets. She does not need to set up a heavy clone detector, nor does she need to do tedious analysis of the unnecessary detection results.

4 Architecture of ccgrep

The architecture of ccgrep is presented in Fig. 1.

Fig. 1.
figure 1

Architecture of ccgrep

  • Tokenizer Generators: Parser generator ANTLR is used to generate two kinds of tokenizers. For the target tokenization, only the language definition is used to recognize the regular tokens, but for the query tokenization, the definition of the meta-tokens and that of regular tokens are used. This process has been executed only once for each target language.

  • Tokenizers: Each tokenizer removes white spaces and comments from the input files and decomposes the code into tokens. The query tokenizer accepts the meta-tokens starting with $ and the regular tokens defined by the language, but the target tokenizer accepts only the regular tokens. The tokenizer for the target files is executed in parallel for each file, along with the following CC Matcher.

  • CC Matcher: This performs a naive sequential pattern matching algorithm between two token sequences for the query (of the length m) and the target (of the length n), whose worst-case time complexity is O(mn) [9]. For type 2 code clone matching, we record the position for each identifier and literal in Map Table to check proper P-matching. The table contents are flushed for each query. Option controls the normalization level, input language, output form, and many others.

  • Output Formatter: This process constructs the output for the successful matching result. Based on the input option, we can view the match result, like grep, in the form of the file name associated with the matched top line as the default, or as many other styles such as full matched lines, only the number of lines, or so on.

ccgrep is written in Java associated with the ANTLR output, and it is very easily installed and executed on various Unix or Windows environments with a single JAR file (about 1M byte) containing all necessary libraries.

5 Evaluation

The goal of the evaluation is to show that our proposed approach can find various kinds of intended code snippets effectively and efficiently. This goal could be decomposed into the following three research questions.

  • RQ1: Query Expressiveness. Are queries for various types of code clones expressible by ccgrep?

  • RQ2: Accuracy of ccgrep. Does ccgrep accurately find various types of code clones already detected by other approaches?

  • RQ3: Performance of ccgrep. What is the execution time of ccgrep? Is the token-based naive sequential pattern matching approach fast enough in practice?

5.1 RQ1: Query Expressiveness

As shown in previous sections, it is obvious that our approach can easily create various query patterns for type 1 matching, type 2 matching with P-match, and type 2 matching with non-P-match, by specifying a code snippet associated with appropriate options. In addition, we can specify the name of an identifier or literal, if we place $ before the name.

A type 3 code-clone snippet is one with a few statement addition, or deletion, or change for a seed snippet. Thus the query for type 3 matching could be made from the seed by adding meta-tokens such as $., $$, or $*, deleting some regular tokens in the seed, or modifying some regular tokens with $., $$, or other meta-tokens.

Therefore, the queries for type 1 to 3 code clones can be effectively created from a code snippet at hand.

5.2 RQ2: Accuracy of ccgrep

For evaluation of query-matching (or information retrieval) systems, recall and precision values, computed by comparing the matched results with the oracles for the queries, are popularly employed [2]. Here in our approach, however, the query to CC matching has no ambiguity and it reports the matching result rigorously as expected and specified by the query with options. In such a sense, the result is always the same as the oracle, i.e., the recall and precision are always 1. Thus, instead of using recall and precision, here we simply investigate if ccgrep works accurately in the sense that code clones already reported by other approaches could be found by our approach.

For this purpose, first, we have employed BigCloneBench [24] which is a huge collection of various kinds of code clones. We have extracted all pairs classified as type 1 and type 2 code clones from BigCloneBench, and for each clone pair (sp1, sp2), we have checked if sp2 is successfully found in the result of ccgrep for sp1 as a query with appropriate options, and vice versa. Table 1 shows the numbers of type 1 and 2 clones found by ccgrep.

Table 1. Checked clones in BigCloneBench

As we can see in Table 1, most type 1 and 2 clones are found accurately. There were several cases of not-found clones, and we have investigated further those cases and recognized that those cases are faults of the classification of BigCloneBench, some of which should be classified into type 3, and some others are not clones. Thus, we can say that all of the proper type 1 and 2 clones in BigCloneBench were perfectly found by ccgrep.

For type 3 clones, since BigCloneBench contains huge type 3 data and we cannot make the queries for those, we have instead used CBCD data [16], that contains 11 type-3 clone sets taken from the source code of Git, the Linux kernel, and PostgreSQL. We have crafted type 3 queries from one of the code snippets in each clone set as the seed and have checked if those queries accurately match the other snippets in the same clone set. We have confirmed that all the crafted queries accurately match other snippets in each clone set.

As far as our investigation, all the matches are controlled by the query and are performed accurately as we have expected.

5.3 RQ3: Performance of ccgrep

It is interesting to know that our approach, i.e., token-based and naive sequential pattern matching, can be implemented fast enough for practical use. We have examined various queries for ccgrep with the target source files of Antlr and Ant in Java, and CBCD data (Git, PostgreSQL, and Linux Kernel) in C, and have measured the performance of ccgrep. Following are employed queries. All execution was made with the default setting of ccgrep except for the language option.

figure h
Table 2. Target and execution result by ccgrep

Table 2 shows the size metrics of the target, the number of found snippets, and the execution time of each query on a workstation with Intel Xeon E5-1603v4 (@2.8 GHz \(\times \) 4), 32 GB RAM, and Windows 10 Pro for WS 64bit.

As we can see from Table 2, the execution times are about 1–10 s even for a few million lines of Linux kernel target. We would think that those are fast and acceptable as a daily-use tool. The execution times for qA to qD are very stable for each target. For example, in the case of Linux, they are about 10 s. even for the small #found case (48 for qA) and the large #found case (187,653 for qC). Thus, we would say that the execution time is not heavily affected by the result size (#found) but mainly affected by the target size (#line). Targets Ant in Java and PgSQL in C have similar sizes around 140–180 Klines, and the execution times are also similar around 1–1.5 s. This would show that the execution time is not strongly affected by the target language.

For comparison to grep we have employed a query qE, that is almost equivalent to qA except qE does not match the targets with more than one line.

figure i

This query is complex and hard to create for inexperienced grep users. It has been executed by grep 3 to 9 times faster than ccgrep, but it missed some expected matches of the code snippets with two or more lines.

As the conclusion of RQ3, although the speed of ccgrep is slower than grep, it is sufficiently fast and acceptable as a search tool even for large targets such as 3 million LOC Linux kernel.

6 Related Works

There are numerous publications on code clone detection methods and their tools [18, 20]. Most of those tools focus on finding all of the code clone pairs in the target file collection. They report all code clones or similar code snippets with similarities higher than a certain threshold. Precisely controlling the matches with meta-symbols like ours cannot be accomplished by those approaches.

There are several tools specialized for finding code snippets. CBCD has been designed for finding related code snippets from a buggy code snippet, by using matching of Program Dependence Graph (PDG) [16]. It can be used to find type 1, 2, and 3 clones; however, the matching generally requires a long pre-processing time to construct PDG, and so this approach would not fit the nimble clone finding that we are interested in. NCDSearch has been designed to find similar code snippets in the pile of source code files for the analysis of code reuse and evolution [12]. The approach would be unique and interesting, but the speed is slower than ours. Micro-clones are recently getting focus due to their importance [4, 13]. Our tool is one of the convenience tools for finding micro clones.

Siamese has been developed for finding code clone pairs for a query method or file using multiple representations of n-gram token sequences with inverted index [17]. It requires a long indexing time (e.g., about 10 min indexing time for 10,000 method target). Thus its application and usage would be different from ours.

Variants of grep such as context grep cgrep, approximate grep agrep, and many others had been proposed and implemented to meet various requirements [1]. However, there is no one for clone-based matching like ours. Semantic-based matching tool sgrep [5], data-structure-based matching tool coccigrep [15], and the logic-based query pattern capturing language [22] were proposed, where the specific notations for the queries are provided without using the notion of clones like ours.

7 Conclusions

We have presented ccgrep that effectively finds code snippets in the target files with the notion of code clone and meta-pattern. It is a practical and effective pattern matching tool, easy-to-use to many software engineers.

As a future direction, we are interested in further performance improvement by using more efficient pattern matching algorithms. Also, we are trying to spread the use of ccgrep to industry collaborators who are trying to detect similar bug patterns in their legacy systems.

Table 3. Token-level matching