1 INTRODUCTION

Software documentation quality issues have been studied since the 1970’s [1], and continue to be addressed nowadays [2]. Besides, the documentation, similarly to software, becomes increasingly complex, requires more and more resources for its development and maintenance.

Copy/paste is commonly used in creating and modifying documents: a fragment of text is copied multiple times, and then edited and expanded to suit the subject. The use of this technique is justified by the fact that many software features described in documentation reuse functionality – this is true for requirements, user interface elements, tests, source code, etc. However, without the aid of specialized tools, repeatedly copied fragments create additional difficulties during maintenance because they require extensive synchronisation of changes in corresponding software features. Software reuse is a considerably more advanced research area than software documentation reuse. There is a multitude of studies on software reuse and a part of them has been adopted by the industry (see surveys [35]). However, the problem of software documentation reuse largely remains a subject of academic research only [610]. We should also mention the problem of documentation unification – if there is a large volume of similar information, it is only reasonable to have it presented in a consistent manner. Thus, duplicate detection and duplicate examination are important for setting up documentation reuse and unification.

Duplicates in software documentation have been extensively studied during the last decade [6, 1117]. At the same time, there are no specialized tools for duplicate detection. Generally, various text search tools are used for this purpose. However, these tools cannot be employed in detection of near duplicates, i.e. text fragments with a substantial common part and certain variations. For this reason, we have created Duplicate Finder [18, 19]. In [16], we have presented a near duplicate search algorithm which considers near duplicates as a combination of exact duplicates. However, the output quality of this algorithm was low because it does not assess the meaningfulness of found duplicates.

In this paper we present an an approach for interactive detection of near duplicates. We involve the user in order to provide meaningfulness of the search process. In short, this process is organized as follows. At first, we automatically create a map of exact duplicates of the document using Clone Miner [20].

The next step relies on the following assumption: an accumulation of exact duplicates in a certain place of the document points to a possible emergence of a near duplicate. At this step, the user moves to the most duplicate-populated document section using this map as a clue. After that, they select the text fragment that contains the most frequently used exact duplicates. Then, the user transforms this fragment into a full description of a certain software feature by extending its bounds. This way, the user ensures the meaningfulness of the fragment. Next, this description (textual string) is used as a pattern for further search. The user can also edit the search results by filtering out false positives and ensuring the meaningfulness of found fragments by expanding or narrowing their bounds in the document.

This article is organized as follows. In Section 2 we present an overview of existing studies that are similar to the current work, and in Section 3 we describe the approaches, methods, and technologies we have used in our study. Section 4 contains the description of the interactive near duplicate search technique. In Section 5 we propose a new formal definition of a near duplicate, and in Section 6 we present the pattern-based near duplicate search algorithm that forms the basis of the proposed technique. In addition, we formulate the criterion of completeness for the algorithm and prove that the proposed algorithm is complete. In Section 7 we provide complexity estimates for the algorithm, and in Section 8 we demonstrate the results of an experimental evaluation.

2 RELATED WORK

Duplicates in software documentation have been extensively studied during the last decade. Horie et al. [6] consider duplicates in API documentation of Java projects, extending JavaDoc with reuse tools. This approach is expanded with consideration of near duplicates by Nosál’ and Porubän in [12]. Similarly to approaches presented in references [810], the authors of this paper employ parametrization for defining variative parts of duplicates. However, this study does not consider the task of near duplicate search itself, and their definition of a near duplicate is informal.

In their further research [13], they examine exact duplicates in embedded documentation of several open-source projects, but do not consider near duplicates.

Wingkvist et al. [14] use duplicates to evaluate the quality of documentation, with no consideration of near duplicates.

The paper [11] by Juergens et al. is an examination of duplicates in requirement specifications: the authors have analyzed 28 industrial documents, manually filtering and classifying found duplicates. The meaning of these duplicates was discussed (with emphasis on duplicates that corresponded to code clones). They also have studied the influence of redundancy on the speed of reading and understanding texts. This work does not consider other types of software documentation, as well as near duplicates, although the authors do mention their existence.

Ouzmazis et al. [7] analyze API documentation of several well-known open source projects, classify detected duplicates, and consider the problem of documentation reuse. They do not consider near duplicates, however, they note that those duplicates occur quite often and are important in practice.

Rago et al. [21] present near duplicate search in textual use case descriptions with the use of natural language processing methods. However, they consider a highly specific type of requirement specifications, which is rarely used in practice. Moreover, it is unclear how to apply this method to other types of documentation.

Concluding our overview, we should note that most existing approaches except [21] use token-based tools of code clone analysis. This fact seriously complicates near duplicate detection. However, some authors acknowledge the existence and importance of near duplicates in documentation redundancy analysis and documentation reuse [7, 11, 12].

The approach presented in this work is largely based on pattern matching. This problem is well-studied and has been solved in multitude of ways for different contexts. Let us provide a short overview of this problem in the context of text search.

The algorithm proposed by Ukkonen [22] makes efficient matching approximate occurrences of pattern in text possible, but it requires a costly preprocessing of the pattern.

Broder [23] describes a method for matching approximate occurrences of pattern in text using information retrieval methods; we should note that this approach also requires expensive preprocessing of input data (both document and pattern).

Algorithms presented in [22, 2426] are efficient but quite sophisticated, which complicates their use and modification, as well as proving their formal properties.

Ukkonen’s algorithm [22] is suited for working with an immutable pattern, and Broder’s approach [23] operates on an immutable document. Both of these situations are irrelevant to our task. Studies by Landau and Vishkin [25], Myers [26] describe algorithms for detecting text fragments for which Levenshtein distance [27] does not exceed a pre-defined threshold. We should emphasize the high computational complexity of Levenshtein distance calculation, which, as shown by our experiments, makes this approach unsuitable for duplicate detection. A more detailed review of approximate pattern matching in text can be found in [28]. Using ideas proposed in this research area, we have created our own pattern matching algorithm while adhering to the following requirements:

(i) the algorithm needs to perform near duplicate detection in accordance with our definition of near duplicates;

(ii) we wanted to formally prove a number of properties of this algorithm;

(iii) the run time has to be adequate because the algorithm is run in an interactive mode;

(iv) the algorithm needs to yield as few false positives as possible.

3 BACKGROUND

3.1 Edit Distance

We use edit distance [27] to determine the degree of similarity of two text fragments (strings of text). This distance is essentially the number of string editing operations required to convert one string into another: the less operations, the more similar the strings. Different definitions of edit distance differ by their admissible operations. In our work, we use longest common subsequence distance [29, 30] that uses only insertion and deletion of a symbol due to its suitability for near duplicate model described below, and, consequently, the convenience of further proofs. The authors of [31] prove that this type of edit distance has metric properties. Further on, we will denote the longest common subsequence distance between two strings \({{s}_{1}}\) and \({{s}_{2}}\) as \(d({{s}_{1}},{{s}_{2}})\).

Computing edit distance is a resource-consuming task. The complexity of the algorithm we have selected is estimated [32] as \(\mathcal{O}{\text{(|}}{{s}_{1}}{\text{|}}{*}{\text{|}}{{s}_{2}}{\text{|)}}\) in the average case. Furthermore, the authors of reference [33] show that it is impossible to design an algorithm that can provide better complexity estimates for the worst case. In this work, we use the difflib library [34], which is included in the standard Python package (performance-critical parts of which are implemented in C).

3.2 Detecting Exact Duplicates with Clone Miner

We have employed exact duplicate detection to create a duplicate map, using which the user could select a pattern for matching. We have selected Clone Miner [20] for this task, which is a software code clone detection tool. Clone Miner is a token-based tool, and it does not employ an abstract syntax tree. A token is a stand-alone word (sequence of symbols) in a document, separated from adjacent words by such delimiters as “.”, “,”, “ ”, “(”, “)”, etc. For example, the fragment “FM registers” consists of two tokens. Clone Miner considers text as an ordered collection of tokens and detects duplicate fragments (clones) using algorithms based on suffix trees [35]. We have chosen Clone Miner because it is easily integrated with other tools through command line interface and supports the Russian language.

4 THE PROCESS

The general purpose of our process is to ensure meaningfulness of duplicate detection via user interaction. A diagram which describes the workflow of the process is presented in Fig. 1. Let us describe the process in detail.

Fig. 1.
figure 1

Process overview.

Generating a duplicate map. Using Clone Miner [20], all exact duplicate groups in the document are detected. Every token (word) \(t\) is assigned a color from an RGB interval from white to red: color(t) = \((h(t){\text{/}}{{T}_{m}}) * R + (1 - (h(t)){\text{/}}{{T}_{m}}) * W,\) where \(R = [1,0,0]\), \(W = [1,1,1]\), h(t) is the exact duplicate group that has the maximum cardinality and contains \(t\) (further on called token temperature), and Tm is the maximum cardinality of an exact duplicate group in the document (maximum token temperature)Footnote 1. The closer a token’s color to red, the “warmer” it is. This metaphoricalrepresentation is called a heat map [36], an example of which can be seen in Fig. 2

Fig. 2.
figure 2

Duplicate map.

The generated heat map provides an overview of the potential near duplicate occurences. The areas most likely to contain near duplicates are represented by red areas of different hue. Tokens that occur in the reddest areas are repeated the same or roughlythe same number of times. Therefore, the probability that these tokens would form a meaningful near duplicate is quite high. This way, one may hope to obtain meaningful near duplicates that appear a significant number of times in the document.

Selecting a search pattern. The user moves to the reddest (the “warmest”) area on the heat map (Fig. 2), zooms in on it, and selects a fragment (pattern) for further search (see Fig. 3). During this process, the user does not only consider the color of the selected fragment, but aims to select a fragment that describes a software feature in full. To achieve this, the user can either include a white-colored text fragment in the pattern, or not include a red-colored one. Consider an example. Following the information from Fig. 3, we select this fragment:

Fig. 3.
figure 3

Selecting a search pattern (PostgreSQL documentation).

“To alter the owner, you must also be a direct or indirect member of the new owning role, and that role must have CREATE privilege on the table’s schema. (These restrictions enforce that altering the owner doesn’t do anything you couldn’t do by dropping and recreating table. However, a superuser can alter ownership of any table anyway.)”

This fragment describes an integral software feature concerning the administration of the PostgreSQL DMBS: to alter the owner of a database table, you must also have specific rights or be an administrator.

Near duplicate search. The user selects a similarity measure for the highlighted fragment, which is a number from \(1{\text{/}}\sqrt 3 \) to 1, and launches the pattern matching algorithmFootnote 2.

Forming a near duplicate group. Having received the algorithm’s output, the user modifies it. During the process the user deletes elements (near duplicate occurences) that only resemble the pattern syntax-wise but not meaning-wise. Furthermore, for each occurence the user can modify thebounds of the fragments to ensure the meaningfulness of each.

5 DEFINING A NEAR DUPLICATE GROUP

In this section we generalize the definition of a near duplicate group that we have proposed earlier in [1637]. In contrast with the previous definition, here we use a parameter instead of a constant to define the similarity measure, and we allow to place extension points at the ends of duplicates. We will consider a document \(D\) as a finite sequence of symbols, denoting its length as \({\text{length}}(D)\).

Definition 1. A text fragment is an occurrence of a certain symbol string in document \(D\).

Therefore, for every text fragment \(g\) of document \(D\) there is an integer interval \([b,e]\), where \(b\) is the position of the first symbol of the fragment, and \(e\) is the position of the last. By \(g \in D\), we denote a text fragment \(g\) of document \(D\). Next, let \([g]\) be the function that maps a text fragment \(g\) to its interval, and let \({\text{str}}(g)\) be the function that maps a text fragment \(g\) to its textual content. By \({\text{b(}}g)\) and \({\text{e}}(g)\) we denote the positions of the beginning and the end of \(g\). Next, \({\text{|}}g{\text{|}}\) is a function that takes a text fragment \(g\) and returns its length as \({\text{|}}g{\text{|}} = 1 + {\text{e}}(g) - {\text{b(}}g)\). Finally, we introduce a two-place predicate \({\text{Before}}({{g}_{1}},{{g}_{2}})\), which is true if and only if \({\text{e}}({{g}_{1}}) < {\text{b}}({{g}_{2}})\).

Definition 2.Near duplicate group. Consider a collection of text fragments \({{g}_{1}}, \ldots ,{{g}_{M}}\) of document D. We will call this collection a near duplicate group with the similarity measure \(k \in (1{\text{/}}\sqrt 3 ,1]\) (or simply a near duplicate group) if the following conditions are satisfied.

(1) \(\forall i \in \{ 1, \ldots ,M - 1\} \) holds \({\text{Before}}({{g}_{i}},{{g}_{{i + 1}}})\);

(2) There exists a ordered collection of strings \(({{I}_{1}}, \ldots ,{{I}_{N}})\) such as there is an occurrence of this collection in every text fragment, i.e. \(\forall j \in \left\{ {1, \ldots ,M} \right\}\)\(\forall i \in \{ 1, \ldots ,N\} {{I}_{i}} \subset {\text{str}}({{g}_{j}})\) and \(\forall i \in \{ 1, \ldots ,N - 1\} \) holds \({\text{Before}}(I_{i}^{j},I_{{i + 1}}^{j})\), where \(I_{i}^{j}\) is an occurrence of \({{I}_{i}}\) in \({{g}_{j}}\), and the following condition is satisfied:

$$\forall j \in \{ 1, \ldots ,M\} \frac{{\sum\limits_{i = 1}^N {{\text{|}}{{I}_{i}}{\text{|}}} }}{{\left| {{{g}_{j}}} \right|}} \geqslant k.$$

We define the archetype of a given group as a collection of strings \(({{I}_{1}}, \ldots ,{{I}_{N}})\). It is easy to show that the definition proposed above generalizes the definition given in [16, 37]. If \(G\) is a near duplicate group, then by |G| denote the number of elements of this group.

Definition 3. Consider a text fragment \(p\) of document \(D\) (pD) and gD. We say that \(g\) is a near duplicate of \(p\) with similarity \(k\), if \(g\) and \(p\) form a near duplicate group with similarity \(k\) defined according to 2.

6 PATTERN BASED NEAR DUPLICATE SEARCH ALGORITHM

6.1 Algorithm De3scription

The algorithm consists of three phases. At phase 1 (scanning), document D is scanned by a sliding window \(w\) of size \({{L}_{w}} = \frac{{{\text{|}}p{\text{|}}}}{k}\) with a one symbol stepFootnote 3. The text fragment that corresponds to the current window position is compared to pattern p using edit distance, and if they are close, i.e. \(d\left( {p,w} \right) \leqslant {{k}_{{di}}}\), then this fragment is saved in the set \({{W}_{1}}\). The threshold value \({{k}_{{{\text{di}}}}}\) is defined as follows:

$${{k}_{{{\text{di}}}}} = \left| p \right|\left( {\frac{1}{k} + 1} \right)(1 - {{k}^{2}}).$$
((1))

This choice will be explained below.

At phase 2 (“shrinking”), we search for the largest text fragment that is closest to pattern \(p\) in every element of \({{W}_{1}}\). Essentially, during this phase lengths of elements of \({{W}_{1}}\) decrease, i.e. text fragments are “shrunk.” Thisis reasonable since the window (and consequently, all elements of \({{W}_{1}}\)) is of maximum possible size of a near duplicate of \(p\) (see Lemma. 1). During “shrinking” for every \({{w}_{2}} \in {{W}_{1}}\), all of its internal fragments are iterated over, starting with fragments of \(\left| p \right| * k\) length up to fragments of \(\frac{{{\text{|}}p{\text{|}}}}{k}\) length. The one that is closest to the pattern in terms of edit distance is selected. If there are several such fragments, the longest one should be taken. This phase results in the set W2.

Algorithm 1: Pattern based near duplicatesearch algorithm

 

Input data:\(D\) – document,

 

\(p\) – pattern, \(k\) – similarity measure

 

Result: \(R\)

 

// Phase 1 (scanning)

1

\({{W}_{1}} \leftarrow \varnothing \)

2

for \(\forall {{w}_{1}}:{{w}_{1}} \in D \wedge \left| {{{w}_{1}}} \right| = {{L}_{w}}\)do

3

 

if \({{{\text{d}}}_{{di}}}({{w}_{1}},p) \leqslant {{k}_{{di}}}\)then

4

  

add \({{w}_{1}}\) to \({{W}_{1}}\)

 

// Phase 2 (“shrinking”)

5

\({{W}_{2}} \leftarrow \varnothing \)

6

for \(w \in {{W}_{1}}\)do

7

 

\(w_{2}^{'} \leftarrow w\)

8

 

for \(l \in I\)do

9

  

for \(\forall {{w}_{2}}:{{w}_{2}} \subseteq w \wedge \left| {{{w}_{2}}} \right| = l\)do

10

   

if \({\text{Compare}}({{w}_{2}},w_{2}^{'},p)\)then

11

    

 \(w_{2}^{'} \leftarrow {{w}_{2}}\)

12

   

add \(w_{2}^{'}\) to \({{W}_{2}}\)

 

// Phase 3 (filtering)

13

\({{W}_{3}} \leftarrow {\text{Unique}}({{W}_{2}})\)

14

for \({{w}_{3}} \in {{W}_{3}}\)do

15

 

if \(\exists w_{3}^{'} \in {{W}_{3}}:{{w}_{3}} \subset w_{3}^{'}\)then

16

  

remove \({{w}_{3}}\) from \({{W}_{3}}\)

17

\(R \leftarrow {{W}_{3}}\)

At the phase 3 (filtering), duplicate elements in \({{W}_{2}}\) are eliminated. They emerge at the previous phase because \({{W}_{1}}\) can contain text fragments that differ by a window shift of several symbols. Furthermore, elements that are fully contained in other elements of \({{W}_{2}}\) are filtered out. This phase results in the set \({{W}_{3}}\) which is the output of the algorithm, i.e. the set R.

Let us describe the auxiliary functions used in Algorithm 1. The \({\text{Compare}}\) function is used during phase 2 to identify the text fragment which is closer to the pattern \(p\) in terms of edit distance. If the distance from both fragments to the pattern is the same, the longest fragment is selected:

$$\begin{gathered} {\text{Compare}}({{w}_{1}},{{w}_{2}},p) \\ = \left\{ \begin{gathered} {\text{true}}\quad d({{w}_{1}},p) < d({{w}_{2}},p) \hfill \\ {\text{false}}\quad d({{w}_{1}},p) > d({{w}_{2}},p) \hfill \\ \left| {{{w}_{1}}} \right| > \left| {{{w}_{2}}} \right|\quad d({{w}_{1}},p) = d({{w}_{2}},p) \hfill \\ \end{gathered} \right. \\ \end{gathered} $$

The Unique function receives a collection of text fragments, iterates over it and discards duplicate fragments.

6.2 Algorithm Completeness

The criterion of completeness for our pattern based near duplicate search algorithm is defined as follows. The algorithm is complete if for arbitrary \(D\), \(p \in D\), output of the algorithm \(R\), and for any near duplicate group \(G\) of fragment \(p\) with similarity \(k\) (see def. 2), the following condition holds true:

$$\forall g \in G\;\;\exists w \in R{\text{:}}\,\left| {g \cap w} \right| \geqslant {{O}_{{\min}}}(k),$$
((2))

where \({{O}_{{\min}}}(k) = \frac{{{\text{|}}p{\text{|}}}}{2}\left( {3k - \frac{1}{k}} \right)\). This criterion can be explained as follows: for any fragment of document D that is a near duplicate of pattern \(p\), the set \(R\) will contain a text fragment that significantly intersects with this near duplicate, allowing the user to easily recognise this duplicate in the output. The ratio of the intersecting portion to the whole pattern is bounded from below by the \({{O}_{{\min}}}(k)\) function. \({{O}_{{\min}}}\left( {\frac{1}{{\sqrt 3 }}} \right) = 0,\) and for larger values of \(k\)\({{O}_{{\min}}}\left( k \right) > 0\). This is true since the function increases with increasing \(k\) – its derivative is \(\frac{{{\text{|}}p{\text{|}}}}{2}\left( {3 + \frac{1}{{{{k}^{2}}}}} \right)\) and it is obviously positive for all \(\frac{1}{{\sqrt 3 }}\) < \(k \leqslant 1\). In practice, the best results are achieved for \(k \geqslant 0.77\): for these values \({{O}_{{\min}}}\left( k \right) > \frac{{{\text{|}}p{\text{|}}}}{2}\), i.e. all elements of R intersect all near duplicates at least by half of the pattern’s length. Note that the lower estimate \({{O}_{{\min}}}(k)\) is pessimistic: the experimental results demonstrate a larger overlap of the output and the near duplicates contained in the document. Let us continue on to the completeness of theproposed algorithm, proving several auxiliary propositions first.

Lemma 1.Let\(G\)be a near duplicate group of fragment \(p\) with similarity\(k\). Then\(\forall {{g}_{1}},{{g}_{2}} \in G\)\(k \leqslant \frac{{{\text{|}}{{g}_{1}}{\text{|}}}}{{{\text{|}}{{g}_{2}}{\text{|}}}} \leqslant \frac{1}{k}\)holds true.

Proof. Suppose \(A\) is the archetype of group \(G\). Then \(k\left| {{{g}_{1}}} \right| \leqslant \left| A \right|\) and \(k\left| {{{g}_{2}}} \right| \leqslant \left| A \right|\). Because \(A \subset {\text{str}}({{g}_{1}})\) and \(A\, \subset \,{\text{str}}({{g}_{2}})\), we have: \(k\left| {{{g}_{1}}} \right|\, \leqslant \,\left| A \right|\, \leqslant \,\left| {{{g}_{1}}} \right|\) and \(k\left| {{{g}_{2}}} \right|\, \leqslant \,{\text{|}}A{\text{|}}\, \leqslant \,\left| {{{g}_{2}}} \right|\). Therefore, \(k{\text{|}}{{g}_{1}}{\text{|}} \leqslant {\text{|}}{{g}_{2}}{\text{|}}\) and \(k{\text{|}}{{g}_{2}}{\text{|}} \leqslant {\text{|}}{{g}_{1}}{\text{|}}\). Dividing these inequalities by \(k\left| {{{g}_{1}}} \right| \leqslant \left| {{{g}_{1}}} \right|\) and \(k\left| {{{g}_{2}}} \right| \leqslant \left| {{{g}_{2}}} \right|\) respectively, we get the required result.

Lemma 2.Let\(G\)bea near duplicate group of fragment \(p\) with similarity \(k\). Then \(\forall g \in G\) the following holds true: \(d(g,p) \leqslant (1 - {{k}^{2}})\left| p \right|\).

Proof. Because \(p\) and \(g\) belong to the same near duplicate group, they have the same archetype and can be presented in the following way:

$$\begin{gathered} p = v_{0}^{p}{{I}_{1}}v_{1}^{p}{{I}_{2}} \ldots v_{{N - 1}}^{p}{{I}_{N}}v_{N}^{p}, \\ g = v_{0}^{g}{{I}_{1}}v_{1}^{g}{{I}_{2}} \ldots v_{{N - 1}}^{g}{{I}_{N}}v_{N}^{g}, \\ \end{gathered} $$

where \({{I}_{1}},{{I}_{2}}, \ldots \mathop {,I}\nolimits_N \) is the archetype of group \(G\), \(v_{0}^{p},v_{1}^{p}, \ldots ,v_{N}^{p}\) is the variative part of \(p\), and \(v_{0}^{g},v_{1}^{g}, \ldots ,v_{N}^{g}\) is the variative part of g.

Let us introduce the following notations: vp = \(v_{0}^{p}v_{1}^{p}, \ldots ,v_{N}^{p}\), \({{v}^{g}} = v_{0}^{g}v_{1}^{g}, \ldots ,v_{N}^{g}\), \(A = {{I}_{1}}{{I}_{2}} \ldots {{I}_{N}}\). Then according to (2), we have \({\text{|}}A{\text{|/|}}p{\text{|}} \geqslant k\)\({\text{|}}p{\text{|}}\, - \,{\text{|}}{{v}^{p}}{\text{|}}\, \geqslant \,{\text{|}}p{\text{|}}k\)\(\left| p \right| - \left| p \right|k \geqslant \left| {{{v}^{p}}} \right|\)\(\left| p \right|(1 - k) \geqslant {{v}^{p}}\), and, likewise, \(\left| g \right|(1 - k) \geqslant {{v}^{g}}\). Moreover, \(g\) can be obtained from \(p\) by substituting \(v_{i}^{p}\) for \(v_{i}^{g}\), i.e. d(g, p) ≤ \({\text{|}}{{v}^{g}}{\text{|}}\, + \,{\text{|}}{{v}^{p}}{\text{|}}\) ≤ (1 – k)(|p| + |g|). According to lemma 1 we have \(\left| g \right| \leqslant k\left| p \right|\). Then \(d(g,p) \leqslant (1 - k)(1 + k)\)|p| = (1 – k2)|p|.

Lemma 3.For any\(p \in D\), \(k \in (1{\text{/}}\sqrt 3 ,1]\), near duplicate group\(G\)of fragment \(p\) with similarity\(k\) (Definition. 3), the criterion of completeness 2 is satisfied in respect to the results of phase 1.

Proof. As mentioned above, the triangle inequality is satisfied for longest common subsequence distance: \(d(fr,p) \leqslant d(fr,g) + d(g,p)\). According to lemma 2, \(d(g,p) \leqslant \left| p \right|(1 - {{k}^{2}})\). We also know that \(g \subseteq fr\). Therefore, because we can obtain \(g\) from \(fr\) by removing all symbols that belong to \(fr{\backslash }g\), \(d(fr,g) \leqslant \left| {fr} \right| - \left| g \right|\) holds true. But because \(\left| {fr} \right| = \tfrac{{\left| p \right|}}{k}\) and according to lemma 1, |g| ≥ \(k{\text{|}}p{\text{|}}\), the following also holds true: \(\left| {fr} \right| - \left| g \right| \leqslant \tfrac{1}{k}\left| p \right|\)k|p|. Therefore, \(d(fr,p) \leqslant \left| p \right|\left( {\frac{1}{k} - k + 1 - {{k}^{2}}} \right)\) = \(\left| p \right|\left( {1 + \frac{1}{k}} \right)\)(1 – k2). It is obvious that during the scanning on phase 1 there will be a state in which the window contains fr. Then, according to (1), \(fr \in {{W}_{1}}\). Therefore, the following holds true:

$$\forall g \in G:\left( {\left| {fr} \right| = \frac{{\left| p \right|}}{k},g \subseteq fr} \right) \Rightarrow fr \in {{W}_{1}}.$$

Since for any near duplicate there is an element of \({{W}_{1}}\) that does not only intersect with this duplicate, but contains it completely, criterion 2 is satisfied for \({{W}_{1}}\) if we consider it as the set \(R\).

Lemma 4.For any\(p \in D\), \(k \in (1{\text{/}}\sqrt 3 ,1]\)andnearduplicate group \(G\) of fragment \(p\) with similarity\(k\) (Definition. 3) the criterion of completeness 2 is satisfied in respect to the output of phase 2.

Proof. We omit a formal proof due to its large size. The main idea here is considering the worst case where during “shrinking,” the length of \({{w}_{1}} \in {{W}_{1}}\) elements is decreased to k|p|. Considering corner cases (an element positioned in the center or right at the ends of \({{w}_{1}}\)) allows us to confirm that the lemma holds true.

Lemma 5.For any\(p \in D\), \(k \in (1{\text{/}}\sqrt 3 ,1]\)and near duplicate group \(G\) of fragment \(p\) with similarity\(k\) (Definition. 3) the criterion of completeness 2 is satisfied in respect to the output of phase 3.

Proof. Phase 3 consists of element deletion from \({{W}_{2}}\) only. The intervals of the deleted elements are contained in intervals of other elements. It is obvious that if \({{W}_{2}}\) satisfied the criterion, then \({{W}_{3}}\) will satisfy it as well.

Theorem 1.The criterion of completeness is satisfied for any\(p \in D\), \(k \in (1{\text{/}}\sqrt 3 ,1]\), corresponding algorithm output \(R\) and any near duplicate group G of fragment p with similarity k.

Proof. The output of phases 1–3 was proven to satisfy the criterion 2 in lemmas 3–5.

6.3 Optimizing the Algorithm

The proposed algorithm turned out to be inadequate performance-wise: its run time exceeded one hour when searching for patterns larger than 100 symbols in documents of about 2 MB in size. Furthermore, the algorithm produced many false positives – its output contained the same text fragments that were insignificantly shifted relatively to each other. As the result, a range of optimizations has been suggested.

Optimization 1 is applied during phase 1 (scanning). It allows to reduce the number of calculations of \(d\), significantly improving the run time of the algorithm. It is based on the known Boyer-Moore algorithm, which is intended for matching a pattern in a string [38]: during the scan, a check is performed to see by how many symbols the window can be shifted without skipping the required result. Therefore, at each step of the scan we check whether d(w, p) > \({{k}_{{{\text{di}}}}} + 1\) (\(w\) is the window position) holds. If it is true, then we slide the window by \((d\left( {w,p} \right) - {{k}_{{{\text{di}}}}}){\text{/}}2\) symbols to the right. Otherwise, we slide it by one symbol.

Optimization 2 is applied during phase 2 (“shrinking”). It allows to reduce the number of computations of \(d\) as well. The approach is similar to the one used in the previous optimization. During “shrinking” of a text fragment \({{w}_{1}}\), the window scans the fragment in a symbol-by-symbol manner. At each step \(d(p,w_{2}^{'})\) is computed and, if necessary, its minimum value \({{d}_{{\min}}}\) is updated. If for the current window position \(w_{2}^{'}\), \(d(p,w_{2}^{'})\) > dmin + 1 holds true, slide the window to the right by \((d(p,w_{2}^{'}) - {{d}_{{\min}}}){\text{/}}2\) symbols. Otherwise, slide it by one symbol. The \({{d}_{{\min}}}\) value is updated at the beginning of each iteration corresponding to the next value of the sliding window width.

Optimization 3 is applied during phase 3 (filtering). It allows to minimize the cardinality of \({{W}_{3}}\). It is as follows: the set is divided into maximum subsets that are transitively closed under intersection. Further, for every such subset a \({{w}_{3}}\) fragment with the minimum value of \(d({{w}_{3}},p)\) is selected, or if there are several such fragments, the one with maximum length. All remaining elements of the set are deleted.

Optimization 4 is applied during phase 3, extending all text fragments of \({{W}_{3}}\) up to complete words. The bounds of a text fragment can ignore the bounds of words, i.e. incomplete words can be included into text fragments. In order to addressthis, text fragments are expanded to include these words fully. This helps to decrease the number of false positives in the algorithm’s output.

Optimization 5 is applied during phases 1 and 2. Its purpose is reusing \(d\) for the same strings and parallelizing the “shrinking” of the elements of \({{W}_{1}}\).

Let us show how these optimizations affect the algorithm’s completeness.

Theorem 2.Optimizations 1, 2, 4, 5 preserve the completeness property.

Proof. Consider two strings that are results of concatenation: \({{s}_{1}} = ab\) and \({{s}_{2}} = bc\), where \(\left| a \right| = \left| c \right|\) and \(d({{s}_{1}},{{s}_{2}})\) = d. We can easily show that \(\left| a \right| + \left| c \right| \geqslant d\). Using this fact, it is easy to prove the completeness of optimization 1. The completeness of optimization 2 is proven in the same way. The completeness of optimization 4 can not be doubted because it only extends the elements of the output. Finally, optimization 5 is complete because it only considers the implementation of the algorithm.

Note 1. Situations where optimization 3 does not satisfy the criterion of completeness are possible, but our experiments show that their number is insignificant in practice.

7 ALGORITHM COMPLEXITY

Document length \(\left| D \right|\), pattern length \(\left| p \right|\), the k value, and the cardinality of the near duplicate group of the pattern \(\left| {{{G}_{p}}} \right|\) are all significant parameters that influence the run time of the algorithm. Let us estimate the algorithm’s complexity depending on these parameters.

The average complexity of calculating d (i.e. edit distance) is \(\mathcal{O}({{\left| p \right|}^{2}})\) [32]. Consequently, the average complexity of phase 1 is proportional to \({{\left| p \right|}^{2}}\) and \(\left| D \right|\). During phase 2 all of the internal fragments of each \({{w}_{1}} \in {{W}_{1}}\) are iterated over, and it is easy to show that their number is proportional to \(\left| p \right|\). Furthermore, the cardinality of the set \({{W}_{1}}\) is proportional to \(\left| {{{G}_{p}}} \right|\) and \({{k}_{{{\text{di}}}}}\). Finally, the complexity of phase 2 is \(\mathcal{O}\left( {\left| {{{G}_{p}}} \right|} \right)\) and \(\mathcal{O}({{\left| p \right|}^{4}})\). The complexity of phase 3 operations is \(\mathcal{O}\left( {\left| {{{W}_{2}}} \right| * \log\left| {{{W}_{2}}} \right|} \right)\), but because \(\left| {{{W}_{2}}} \right| = \left| {{{W}_{1}}} \right|\), the complexity of phase 3 is \(\mathcal{O}\left( {\left| {{{G}_{p}}} \right| * \log\left| {{{G}_{p}}} \right|} \right)\) and \(\mathcal{O}\left( {\left| p \right| * \log\left| p \right|} \right)\). Optimizations 1 and 2 on average lead to “skips” during iteration, the size of which is proportional to \({{k}_{{{\text{di}}}}}\) (and hence \({\text{|}}p{\text{|}}\)), making the complexity of phases 1 and 2 \(\mathcal{O}\left( {\left| p \right|} \right)\) and \(\mathcal{O}({{\left| p \right|}^{3}})\) respectively. Therefore, with k = const the algorithm’s run time can be estimated as \(\mathcal{O}\left( {\left| D \right|} \right)\), \(\mathcal{O}({{\left| p \right|}^{3}})\), and \(\mathcal{O}\left( {\left| {{{G}_{p}}} \right| * \log\left| {{{G}_{p}}} \right|} \right)\).

Theorem 3. The complexity of the algorithm with fixed D and p is estimated as \(\mathcal{O}(1{\text{/}}{{k}^{4}})\) on average.

We omit the proof due to its large volume.

8 EVALUATION

Theoretical complexity estimates are not sufficient for determining the real run time of the proposed algorithm. These estimates were produced using certain significant parameters of the algorithm independently to simplify the proofs, while real complexity can depend on their combinations. Another argument for the necessity of experimental evaluation is the fact that theoretical estimates do not provide the real value intervals of these parameters. Finally, other properties of the algorithm need to be evaluated as well.

We have conducted our experiments to answer the following questions:

(i) what is the run time of the pattern matching algorithm on real data;

(ii) how large are algorithm’s outputs having real data as input.

The first question is important because the algorithm is used in interactive mode, and therefore its run time should not exceed several minutes. Considering output volume, we have proven that our algorithm’s output contains all existing near duplicates of a certain pattern. It is, however, unclear, how exactly large are the real outputs of the algorithm – outputs that contain over 100 elements become more or less unfeasible for human analysis. In turn, output volume is affected by the number of false positive matches and the number of near duplicates in the document

We have experimented on 19 industrial documents both in Russian and English (described in reference [16]). The experiments were conducted on a computer with the following specifications: Intel Core i7 2600, 3.4 GHz, 16 GB RAM. The documents were converted into “flat text” (UTF-8 encoding) with Pandoc [39]. After the convertation, the size of the documents ranged from 0.04 MB to 2.5 MB (0.75 MB on average). We are inclined to think that these numbers are realistic for \({\text{|}}D{\text{|}}\). However, we should note that it is necessary to create a more representative selection of different documentation types in order to obtain more precise estimates.

The experiments were conducted as follows. We have run the algorithm for patterns of length ranging from 50 to 1000 symbols with a 50-symbol step. A 1000-symbol fragment is about 25% of a page of a docx document, i.e. it is a large fragment, and following our experiments, duplicates are significantly smaller in general. We have iterated the similarity measure value k from 0.6 to 1 with 0.1 step for each selected pattern in each document. We have selected the pattern in the following way. Having a fixed pattern length, we followed our technique and selected the “warmest” area in the document of this length, calculating it automatically as a fragment where the following expression reaches its maximum value: \(\sum\nolimits_{t \in fr} \,h(t)\). In this expression \(t\) is a token of fragment \(fr\) and \(h(t)\) is its temperature. The sum is calculated over all tokens of the fragment, including possibly incomplete leftmost and rightmost tokens.

Analyzing the data obtained from the experiments to answer the question whether the algorithm’s run time is suitable for interactivity, we have established the following: in 38% of cases the algorithm ran for less than 5 s, in 78% cases – less than 30 s, in 90% of cases – less than 2 min. These run times are fairly adequate for interactive mode.

We have obtained the following data on the output volumes of our algorithm: 84% of outputs contained less than 100 elements, 5% outputs – from 100 to 200 elements, 5.6% – from 200 to 600 elements, 5.4% – from 600 to 1000 elements. Thus, the majority of near duplicate groups in software documentation are relatively small (containing up to 100 elements), which follows from Theorem 1 and our experimental results.

CONCLUSIONS

In this study we have presented an interactive near duplicate search process for software documentation. This process solves the problem of meaningful extraction of near duplicates by involving the user, who can use an automatically generated heat map of exact duplicates to detect the most probable occurrences of near duplicates. We have created a pattern-based near duplicate search algorithm and provided optimizations for it. We have proven the completeness of the algorithm, meaning that all near duplicates contained in the document are present in the algorithm’s output. More precisely, duplicates located in the document significantly intersect with particular elements of the output, and this is why the user can identify them with ease. Our process allows user to manually edit their bounds and to include them in the output in full. We present complexity estimates for our algorithm as well as experimental results. These results suggest that duplicate groups in software documentation generally do not exceed 100 elements, and the algorithm itself performs adequately for practical use.

In the future, we plan to study different types of software documentation in detail using our algorithm and experiment model (focusing on API documentation first). We also intend to thoroughly examine the behavior of our algorithm with varying input parameters (pattern length and similarity measure). Finally, we plan to switch to automatic methods of detecting meaningful duplicates via machine learning. A detailed analysis of different types of near duplicates in different software documentation types is required as well. Other fruitful areas for future work are integration of documentation reuse (in the context of requirement development) with automatic test development [40, 41], and visualization of duplicate structure using diagrams [42].