Interactive Near Duplicate Search in Software Documentation

Luciv, D. V.; Koznov, D. V.; Shelikhovskii, A. A.; Romanovsky, K. Yu.; Chernishev, G. A.; Terekhov, A. N.; Grigoriev, D. A.; Smirnova, A. N.; Borovkov, D. V.; Vasenina, A. I.

doi:10.1134/S0361768819060045

Interactive Near Duplicate Search in Software Documentation

Published: 03 December 2019

Volume 45, pages 346–355, (2019)
Cite this article

Download PDF

Access provided by Autonomous University of Puebla

Programming and Computer Software Aims and scope Submit manuscript

Interactive Near Duplicate Search in Software Documentation

Download PDF

D. V. Luciv¹,
D. V. Koznov¹,
A. A. Shelikhovskii¹,
K. Yu. Romanovsky¹,
G. A. Chernishev¹,
A. N. Terekhov¹,
D. A. Grigoriev¹,
A. N. Smirnova¹,
D. V. Borovkov¹ &
…
A. I. Vasenina¹

124 Accesses
1 Citation
Explore all metrics

Abstract

Various software features such as classes, methods, requirements, and tests often have similar functionality. This can lead to emergence of duplicates in their descriptive documentation. Uncontrolled duplicates created via copy/paste hinder the process of documentation maintenance. Therefore, the task of duplicate detection in software documentation is of importance. Solving it makes planned reuse possible, as well as creating and using templates for unification and automatic generation of documentation. In this paper, we present an approach for interactive detection of near duplicates that involves the user in order to conduct meaningful search. It includes a new formal definition of a near duplicate, a pattern-based , and the proof of its completeness. Moreover, we demonstrate the results of experimenting on a collection of documents of several industrial projects.

Detecting Near Duplicates in Software Documentation

Article 21 September 2018

Clone Detection in Reuse of Software Technical Documentation

On fuzzy repetitions detection in documentation reuse

Article 23 July 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 INTRODUCTION

Software documentation quality issues have been studied since the 1970’s [1], and continue to be addressed nowadays [2]. Besides, the documentation, similarly to software, becomes increasingly complex, requires more and more resources for its development and maintenance.

Copy/paste is commonly used in creating and modifying documents: a fragment of text is copied multiple times, and then edited and expanded to suit the subject. The use of this technique is justified by the fact that many software features described in documentation reuse functionality – this is true for requirements, user interface elements, tests, source code, etc. However, without the aid of specialized tools, repeatedly copied fragments create additional difficulties during maintenance because they require extensive synchronisation of changes in corresponding software features. Software reuse is a considerably more advanced research area than software documentation reuse. There is a multitude of studies on software reuse and a part of them has been adopted by the industry (see surveys [3–5]). However, the problem of software documentation reuse largely remains a subject of academic research only [6–10]. We should also mention the problem of documentation unification – if there is a large volume of similar information, it is only reasonable to have it presented in a consistent manner. Thus, duplicate detection and duplicate examination are important for setting up documentation reuse and unification.

Duplicates in software documentation have been extensively studied during the last decade [6, 11–17]. At the same time, there are no specialized tools for duplicate detection. Generally, various text search tools are used for this purpose. However, these tools cannot be employed in detection of near duplicates, i.e. text fragments with a substantial common part and certain variations. For this reason, we have created Duplicate Finder [18, 19]. In [16], we have presented a near duplicate search algorithm which considers near duplicates as a combination of exact duplicates. However, the output quality of this algorithm was low because it does not assess the meaningfulness of found duplicates.

In this paper we present an an approach for interactive detection of near duplicates. We involve the user in order to provide meaningfulness of the search process. In short, this process is organized as follows. At first, we automatically create a map of exact duplicates of the document using Clone Miner [20].

The next step relies on the following assumption: an accumulation of exact duplicates in a certain place of the document points to a possible emergence of a near duplicate. At this step, the user moves to the most duplicate-populated document section using this map as a clue. After that, they select the text fragment that contains the most frequently used exact duplicates. Then, the user transforms this fragment into a full description of a certain software feature by extending its bounds. This way, the user ensures the meaningfulness of the fragment. Next, this description (textual string) is used as a pattern for further search. The user can also edit the search results by filtering out false positives and ensuring the meaningfulness of found fragments by expanding or narrowing their bounds in the document.

This article is organized as follows. In Section 2 we present an overview of existing studies that are similar to the current work, and in Section 3 we describe the approaches, methods, and technologies we have used in our study. Section 4 contains the description of the interactive near duplicate search technique. In Section 5 we propose a new formal definition of a near duplicate, and in Section 6 we present the pattern-based near duplicate search algorithm that forms the basis of the proposed technique. In addition, we formulate the criterion of completeness for the algorithm and prove that the proposed algorithm is complete. In Section 7 we provide complexity estimates for the algorithm, and in Section 8 we demonstrate the results of an experimental evaluation.

2 RELATED WORK

Duplicates in software documentation have been extensively studied during the last decade. Horie et al. [6] consider duplicates in API documentation of Java projects, extending JavaDoc with reuse tools. This approach is expanded with consideration of near duplicates by Nosál’ and Porubän in [12]. Similarly to approaches presented in references [8–10], the authors of this paper employ parametrization for defining variative parts of duplicates. However, this study does not consider the task of near duplicate search itself, and their definition of a near duplicate is informal.

In their further research [13], they examine exact duplicates in embedded documentation of several open-source projects, but do not consider near duplicates.

Wingkvist et al. [14] use duplicates to evaluate the quality of documentation, with no consideration of near duplicates.

The paper [11] by Juergens et al. is an examination of duplicates in requirement specifications: the authors have analyzed 28 industrial documents, manually filtering and classifying found duplicates. The meaning of these duplicates was discussed (with emphasis on duplicates that corresponded to code clones). They also have studied the influence of redundancy on the speed of reading and understanding texts. This work does not consider other types of software documentation, as well as near duplicates, although the authors do mention their existence.

Ouzmazis et al. [7] analyze API documentation of several well-known open source projects, classify detected duplicates, and consider the problem of documentation reuse. They do not consider near duplicates, however, they note that those duplicates occur quite often and are important in practice.

Rago et al. [21] present near duplicate search in textual use case descriptions with the use of natural language processing methods. However, they consider a highly specific type of requirement specifications, which is rarely used in practice. Moreover, it is unclear how to apply this method to other types of documentation.

Concluding our overview, we should note that most existing approaches except [21] use token-based tools of code clone analysis. This fact seriously complicates near duplicate detection. However, some authors acknowledge the existence and importance of near duplicates in documentation redundancy analysis and documentation reuse [7, 11, 12].

The approach presented in this work is largely based on pattern matching. This problem is well-studied and has been solved in multitude of ways for different contexts. Let us provide a short overview of this problem in the context of text search.

The algorithm proposed by Ukkonen [22] makes efficient matching approximate occurrences of pattern in text possible, but it requires a costly preprocessing of the pattern.

Broder [23] describes a method for matching approximate occurrences of pattern in text using information retrieval methods; we should note that this approach also requires expensive preprocessing of input data (both document and pattern).

Algorithms presented in [22, 24–26] are efficient but quite sophisticated, which complicates their use and modification, as well as proving their formal properties.

Ukkonen’s algorithm [22] is suited for working with an immutable pattern, and Broder’s approach [23] operates on an immutable document. Both of these situations are irrelevant to our task. Studies by Landau and Vishkin [25], Myers [26] describe algorithms for detecting text fragments for which Levenshtein distance [27] does not exceed a pre-defined threshold. We should emphasize the high computational complexity of Levenshtein distance calculation, which, as shown by our experiments, makes this approach unsuitable for duplicate detection. A more detailed review of approximate pattern matching in text can be found in [28]. Using ideas proposed in this research area, we have created our own pattern matching algorithm while adhering to the following requirements:

(i) the algorithm needs to perform near duplicate detection in accordance with our definition of near duplicates;

(ii) we wanted to formally prove a number of properties of this algorithm;

(iii) the run time has to be adequate because the algorithm is run in an interactive mode;

(iv) the algorithm needs to yield as few false positives as possible.

3 BACKGROUND

3.1 Edit Distance

We use edit distance [27] to determine the degree of similarity of two text fragments (strings of text). This distance is essentially the number of string editing operations required to convert one string into another: the less operations, the more similar the strings. Different definitions of edit distance differ by their admissible operations. In our work, we use longest common subsequence distance [29, 30] that uses only insertion and deletion of a symbol due to its suitability for near duplicate model described below, and, consequently, the convenience of further proofs. The authors of [31] prove that this type of edit distance has metric properties. Further on, we will denote the longest common subsequence distance between two strings ${{s}_{1}}$ and ${{s}_{2}}$ as $d({{s}_{1}},{{s}_{2}})$.

Computing edit distance is a resource-consuming task. The complexity of the algorithm we have selected is estimated [32] as $\mathcal{O}{\text{(|}}{{s}_{1}}{\text{|}}{*}{\text{|}}{{s}_{2}}{\text{|)}}$ in the average case. Furthermore, the authors of reference [33] show that it is impossible to design an algorithm that can provide better complexity estimates for the worst case. In this work, we use the difflib library [34], which is included in the standard Python package (performance-critical parts of which are implemented in C).

3.2 Detecting Exact Duplicates with Clone Miner

We have employed exact duplicate detection to create a duplicate map, using which the user could select a pattern for matching. We have selected Clone Miner [20] for this task, which is a software code clone detection tool. Clone Miner is a token-based tool, and it does not employ an abstract syntax tree. A token is a stand-alone word (sequence of symbols) in a document, separated from adjacent words by such delimiters as “.”, “,”, “ ”, “(”, “)”, etc. For example, the fragment “FM registers” consists of two tokens. Clone Miner considers text as an ordered collection of tokens and detects duplicate fragments (clones) using algorithms based on suffix trees [35]. We have chosen Clone Miner because it is easily integrated with other tools through command line interface and supports the Russian language.

4 THE PROCESS

The general purpose of our process is to ensure meaningfulness of duplicate detection via user interaction. A diagram which describes the workflow of the process is presented in Fig. 1. Let us describe the process in detail.

Generating a duplicate map. Using Clone Miner [20], all exact duplicate groups in the document are detected. Every token (word) $t$ is assigned a color from an RGB interval from white to red: color(t) = $(h(t){\text{/}}{{T}_{m}}) * R + (1 - (h(t)){\text{/}}{{T}_{m}}) * W,$ where $R = [1,0,0]$, $W = [1,1,1]$, h(t) is the exact duplicate group that has the maximum cardinality and contains $t$ (further on called token temperature), and T_m is the maximum cardinality of an exact duplicate group in the document (maximum token temperature)^{Footnote 1}. The closer a token’s color to red, the “warmer” it is. This metaphoricalrepresentation is called a heat map [36], an example of which can be seen in Fig. 2

The generated heat map provides an overview of the potential near duplicate occurences. The areas most likely to contain near duplicates are represented by red areas of different hue. Tokens that occur in the reddest areas are repeated the same or roughlythe same number of times. Therefore, the probability that these tokens would form a meaningful near duplicate is quite high. This way, one may hope to obtain meaningful near duplicates that appear a significant number of times in the document.

Selecting a search pattern. The user moves to the reddest (the “warmest”) area on the heat map (Fig. 2), zooms in on it, and selects a fragment (pattern) for further search (see Fig. 3). During this process, the user does not only consider the color of the selected fragment, but aims to select a fragment that describes a software feature in full. To achieve this, the user can either include a white-colored text fragment in the pattern, or not include a red-colored one. Consider an example. Following the information from Fig. 3, we select this fragment:

“To alter the owner, you must also be a direct or indirect member of the new owning role, and that role must have CREATE privilege on the table’s schema. (These restrictions enforce that altering the owner doesn’t do anything you couldn’t do by dropping and recreating table. However, a superuser can alter ownership of any table anyway.)”

This fragment describes an integral software feature concerning the administration of the PostgreSQL DMBS: to alter the owner of a database table, you must also have specific rights or be an administrator.

Near duplicate search. The user selects a similarity measure for the highlighted fragment, which is a number from $1{\text{/}}\sqrt 3 $ to 1, and launches the pattern matching algorithm^{Footnote 2}.

Forming a near duplicate group. Having received the algorithm’s output, the user modifies it. During the process the user deletes elements (near duplicate occurences) that only resemble the pattern syntax-wise but not meaning-wise. Furthermore, for each occurence the user can modify thebounds of the fragments to ensure the meaningfulness of each.

5 DEFINING A NEAR DUPLICATE GROUP

In this section we generalize the definition of a near duplicate group that we have proposed earlier in [16, 37]. In contrast with the previous definition, here we use a parameter instead of a constant to define the similarity measure, and we allow to place extension points at the ends of duplicates. We will consider a document $D$ as a finite sequence of symbols, denoting its length as ${\text{length}}(D)$.

Definition 1. A text fragment is an occurrence of a certain symbol string in document $D$.

Therefore, for every text fragment $g$ of document $D$ there is an integer interval $[b,e]$, where $b$ is the position of the first symbol of the fragment, and $e$ is the position of the last. By $g \in D$, we denote a text fragment $g$ of document $D$. Next, let $[g]$ be the function that maps a text fragment $g$ to its interval, and let ${\text{str}}(g)$ be the function that maps a text fragment $g$ to its textual content. By ${\text{b(}}g)$ and ${\text{e}}(g)$ we denote the positions of the beginning and the end of $g$. Next, ${\text{|}}g{\text{|}}$ is a function that takes a text fragment $g$ and returns its length as ${\text{|}}g{\text{|}} = 1 + {\text{e}}(g) - {\text{b(}}g)$. Finally, we introduce a two-place predicate ${\text{Before}}({{g}_{1}},{{g}_{2}})$, which is true if and only if ${\text{e}}({{g}_{1}}) < {\text{b}}({{g}_{2}})$.

Definition 2.Near duplicate group. Consider a collection of text fragments ${{g}_{1}}, \ldots ,{{g}_{M}}$ of document D. We will call this collection a near duplicate group with the similarity measure $k \in (1{\text{/}}\sqrt 3 ,1]$ (or simply a near duplicate group) if the following conditions are satisfied.

(1) $\forall i \in \{ 1, \ldots ,M - 1\} $ holds ${\text{Before}}({{g}_{i}},{{g}_{{i + 1}}})$;

(2) There exists a ordered collection of strings $({{I}_{1}}, \ldots ,{{I}_{N}})$ such as there is an occurrence of this collection in every text fragment, i.e. $\forall j \in \left\{ {1, \ldots ,M} \right\}$$\forall i \in \{ 1, \ldots ,N\} {{I}_{i}} \subset {\text{str}}({{g}_{j}})$ and $\forall i \in \{ 1, \ldots ,N - 1\} $ holds ${\text{Before}}(I_{i}^{j},I_{{i + 1}}^{j})$, where $I_{i}^{j}$ is an occurrence of ${{I}_{i}}$ in ${{g}_{j}}$, and the following condition is satisfied:

$$\forall j \in \{ 1, \ldots ,M\} \frac{{\sum\limits_{i = 1}^N {{\text{|}}{{I}_{i}}{\text{|}}} }}{{\left| {{{g}_{j}}} \right|}} \geqslant k.$$

We define the archetype of a given group as a collection of strings $({{I}_{1}}, \ldots ,{{I}_{N}})$. It is easy to show that the definition proposed above generalizes the definition given in [16, 37]. If $G$ is a near duplicate group, then by |G| denote the number of elements of this group.

Definition 3. Consider a text fragment $p$ of document $D$ (p ∈ D) and g ∈ D. We say that $g$ is a near duplicate of $p$ with similarity $k$, if $g$ and $p$ form a near duplicate group with similarity $k$ defined according to 2.

6 PATTERN BASED NEAR DUPLICATE SEARCH ALGORITHM

6.1 Algorithm De3scription

The algorithm consists of three phases. At phase 1 (scanning), document D is scanned by a sliding window $w$ of size ${{L}_{w}} = \frac{{{\text{|}}p{\text{|}}}}{k}$ with a one symbol step^{Footnote 3}. The text fragment that corresponds to the current window position is compared to pattern p using edit distance, and if they are close, i.e. $d\left( {p,w} \right) \leqslant {{k}_{{di}}}$, then this fragment is saved in the set ${{W}_{1}}$. The threshold value ${{k}_{{{\text{di}}}}}$ is defined as follows:

$${{k}_{{{\text{di}}}}} = \left| p \right|\left( {\frac{1}{k} + 1} \right)(1 - {{k}^{2}}).$$

((1))

This choice will be explained below.

At phase 2 (“shrinking”), we search for the largest text fragment that is closest to pattern $p$ in every element of ${{W}_{1}}$. Essentially, during this phase lengths of elements of ${{W}_{1}}$ decrease, i.e. text fragments are “shrunk.” Thisis reasonable since the window (and consequently, all elements of ${{W}_{1}}$) is of maximum possible size of a near duplicate of $p$ (see Lemma. 1). During “shrinking” for every ${{w}_{2}} \in {{W}_{1}}$, all of its internal fragments are iterated over, starting with fragments of $\left| p \right| * k$ length up to fragments of $\frac{{{\text{|}}p{\text{|}}}}{k}$ length. The one that is closest to the pattern in terms of edit distance is selected. If there are several such fragments, the longest one should be taken. This phase results in the set W₂.

Algorithm 1: Pattern based near duplicatesearch algorithm
	Input data:$D$ – document,
	$p$ – pattern, $k$ – similarity measure
	Result: $R$
	// Phase 1 (scanning)
1	${{W}_{1}} \leftarrow \varnothing $
2	for $\forall {{w}_{1}}:{{w}_{1}} \in D \wedge \left\| {{{w}_{1}}} \right\| = {{L}_{w}}$do
3		if ${{{\text{d}}}_{{di}}}({{w}_{1}},p) \leqslant {{k}_{{di}}}$then
4			add ${{w}_{1}}$ to ${{W}_{1}}$
	// Phase 2 (“shrinking”)
5	${{W}_{2}} \leftarrow \varnothing $
6	for $w \in {{W}_{1}}$do
7		$w_{2}^{'} \leftarrow w$
8		for $l \in I$do
9			for $\forall {{w}_{2}}:{{w}_{2}} \subseteq w \wedge \left\| {{{w}_{2}}} \right\| = l$do
10				if ${\text{Compare}}({{w}_{2}},w_{2}^{'},p)$then
11					$w_{2}^{'} \leftarrow {{w}_{2}}$
12				add $w_{2}^{'}$ to ${{W}_{2}}$
	// Phase 3 (filtering)
13	${{W}_{3}} \leftarrow {\text{Unique}}({{W}_{2}})$
14	for ${{w}_{3}} \in {{W}_{3}}$do
15		if $\exists w_{3}^{'} \in {{W}_{3}}:{{w}_{3}} \subset w_{3}^{'}$then
16			remove ${{w}_{3}}$ from ${{W}_{3}}$
17	$R \leftarrow {{W}_{3}}$

At the phase 3 (filtering), duplicate elements in ${{W}_{2}}$ are eliminated. They emerge at the previous phase because ${{W}_{1}}$ can contain text fragments that differ by a window shift of several symbols. Furthermore, elements that are fully contained in other elements of ${{W}_{2}}$ are filtered out. This phase results in the set ${{W}_{3}}$ which is the output of the algorithm, i.e. the set R.

Let us describe the auxiliary functions used in Algorithm 1. The ${\text{Compare}}$ function is used during phase 2 to identify the text fragment which is closer to the pattern $p$ in terms of edit distance. If the distance from both fragments to the pattern is the same, the longest fragment is selected:

$$\begin{gathered} {\text{Compare}}({{w}_{1}},{{w}_{2}},p) \\ = \left\{ \begin{gathered} {\text{true}}\quad d({{w}_{1}},p) < d({{w}_{2}},p) \hfill \\ {\text{false}}\quad d({{w}_{1}},p) > d({{w}_{2}},p) \hfill \\ \left| {{{w}_{1}}} \right| > \left| {{{w}_{2}}} \right|\quad d({{w}_{1}},p) = d({{w}_{2}},p) \hfill \\ \end{gathered} \right. \\ \end{gathered} $$

The Unique function receives a collection of text fragments, iterates over it and discards duplicate fragments.

6.2 Algorithm Completeness

The criterion of completeness for our pattern based near duplicate search algorithm is defined as follows. The algorithm is complete if for arbitrary $D$, $p \in D$, output of the algorithm $R$, and for any near duplicate group $G$ of fragment $p$ with similarity $k$ (see def. 2), the following condition holds true:

$$\forall g \in G\;\;\exists w \in R{\text{:}}\,\left| {g \cap w} \right| \geqslant {{O}_{{\min}}}(k),$$

((2))

where ${{O}_{{\min}}}(k) = \frac{{{\text{|}}p{\text{|}}}}{2}\left( {3k - \frac{1}{k}} \right)$. This criterion can be explained as follows: for any fragment of document D that is a near duplicate of pattern $p$, the set $R$ will contain a text fragment that significantly intersects with this near duplicate, allowing the user to easily recognise this duplicate in the output. The ratio of the intersecting portion to the whole pattern is bounded from below by the ${{O}_{{\min}}}(k)$ function. ${{O}_{{\min}}}\left( {\frac{1}{{\sqrt 3 }}} \right) = 0,$ and for larger values of $k$${{O}_{{\min}}}\left( k \right) > 0$. This is true since the function increases with increasing $k$ – its derivative is $\frac{{{\text{|}}p{\text{|}}}}{2}\left( {3 + \frac{1}{{{{k}^{2}}}}} \right)$ and it is obviously positive for all $\frac{1}{{\sqrt 3 }}$ < $k \leqslant 1$. In practice, the best results are achieved for $k \geqslant 0.77$: for these values ${{O}_{{\min}}}\left( k \right) > \frac{{{\text{|}}p{\text{|}}}}{2}$, i.e. all elements of R intersect all near duplicates at least by half of the pattern’s length. Note that the lower estimate ${{O}_{{\min}}}(k)$ is pessimistic: the experimental results demonstrate a larger overlap of the output and the near duplicates contained in the document. Let us continue on to the completeness of theproposed algorithm, proving several auxiliary propositions first.

Lemma 1.Let$G$be a near duplicate group of fragment $p$ with similarity$k$. Then$\forall {{g}_{1}},{{g}_{2}} \in G$$k \leqslant \frac{{{\text{|}}{{g}_{1}}{\text{|}}}}{{{\text{|}}{{g}_{2}}{\text{|}}}} \leqslant \frac{1}{k}$holds true.

Proof. Suppose $A$ is the archetype of group $G$. Then $k\left| {{{g}_{1}}} \right| \leqslant \left| A \right|$ and $k\left| {{{g}_{2}}} \right| \leqslant \left| A \right|$. Because $A \subset {\text{str}}({{g}_{1}})$ and $A\, \subset \,{\text{str}}({{g}_{2}})$, we have: $k\left| {{{g}_{1}}} \right|\, \leqslant \,\left| A \right|\, \leqslant \,\left| {{{g}_{1}}} \right|$ and $k\left| {{{g}_{2}}} \right|\, \leqslant \,{\text{|}}A{\text{|}}\, \leqslant \,\left| {{{g}_{2}}} \right|$. Therefore, $k{\text{|}}{{g}_{1}}{\text{|}} \leqslant {\text{|}}{{g}_{2}}{\text{|}}$ and $k{\text{|}}{{g}_{2}}{\text{|}} \leqslant {\text{|}}{{g}_{1}}{\text{|}}$. Dividing these inequalities by $k\left| {{{g}_{1}}} \right| \leqslant \left| {{{g}_{1}}} \right|$ and $k\left| {{{g}_{2}}} \right| \leqslant \left| {{{g}_{2}}} \right|$ respectively, we get the required result.

Lemma 2.Let$G$bea near duplicate group of fragment $p$ with similarity $k$. Then $\forall g \in G$ the following holds true: $d(g,p) \leqslant (1 - {{k}^{2}})\left| p \right|$.

Proof. Because $p$ and $g$ belong to the same near duplicate group, they have the same archetype and can be presented in the following way:

$$\begin{gathered} p = v_{0}^{p}{{I}_{1}}v_{1}^{p}{{I}_{2}} \ldots v_{{N - 1}}^{p}{{I}_{N}}v_{N}^{p}, \\ g = v_{0}^{g}{{I}_{1}}v_{1}^{g}{{I}_{2}} \ldots v_{{N - 1}}^{g}{{I}_{N}}v_{N}^{g}, \\ \end{gathered} $$

where ${{I}_{1}},{{I}_{2}}, \ldots \mathop {,I}\nolimits_N $ is the archetype of group $G$, $v_{0}^{p},v_{1}^{p}, \ldots ,v_{N}^{p}$ is the variative part of $p$, and $v_{0}^{g},v_{1}^{g}, \ldots ,v_{N}^{g}$ is the variative part of g.

Let us introduce the following notations: v^p = $v_{0}^{p}v_{1}^{p}, \ldots ,v_{N}^{p}$, ${{v}^{g}} = v_{0}^{g}v_{1}^{g}, \ldots ,v_{N}^{g}$, $A = {{I}_{1}}{{I}_{2}} \ldots {{I}_{N}}$. Then according to (2), we have ${\text{|}}A{\text{|/|}}p{\text{|}} \geqslant k$ ⇒ ${\text{|}}p{\text{|}}\, - \,{\text{|}}{{v}^{p}}{\text{|}}\, \geqslant \,{\text{|}}p{\text{|}}k$ ⇒ $\left| p \right| - \left| p \right|k \geqslant \left| {{{v}^{p}}} \right|$ ⇒ $\left| p \right|(1 - k) \geqslant {{v}^{p}}$, and, likewise, $\left| g \right|(1 - k) \geqslant {{v}^{g}}$. Moreover, $g$ can be obtained from $p$ by substituting $v_{i}^{p}$ for $v_{i}^{g}$, i.e. d(g, p) ≤ ${\text{|}}{{v}^{g}}{\text{|}}\, + \,{\text{|}}{{v}^{p}}{\text{|}}$ ≤ (1 – k)(|p| + |g|). According to lemma 1 we have $\left| g \right| \leqslant k\left| p \right|$. Then $d(g,p) \leqslant (1 - k)(1 + k)$|p| = (1 – k²)|p|.

Lemma 3.For any$p \in D$, $k \in (1{\text{/}}\sqrt 3 ,1]$, near duplicate group$G$of fragment $p$ with similarity$k$ (Definition. 3), the criterion of completeness 2 is satisfied in respect to the results of phase 1.

Proof. As mentioned above, the triangle inequality is satisfied for longest common subsequence distance: $d(fr,p) \leqslant d(fr,g) + d(g,p)$. According to lemma 2, $d(g,p) \leqslant \left| p \right|(1 - {{k}^{2}})$. We also know that $g \subseteq fr$. Therefore, because we can obtain $g$ from $fr$ by removing all symbols that belong to $fr{\backslash }g$, $d(fr,g) \leqslant \left| {fr} \right| - \left| g \right|$ holds true. But because $\left| {fr} \right| = \tfrac{{\left| p \right|}}{k}$ and according to lemma 1, |g| ≥ $k{\text{|}}p{\text{|}}$, the following also holds true: $\left| {fr} \right| - \left| g \right| \leqslant \tfrac{1}{k}\left| p \right|$ – k|p|. Therefore, $d(fr,p) \leqslant \left| p \right|\left( {\frac{1}{k} - k + 1 - {{k}^{2}}} \right)$ = $\left| p \right|\left( {1 + \frac{1}{k}} \right)$(1 – k²). It is obvious that during the scanning on phase 1 there will be a state in which the window contains fr. Then, according to (1), $fr \in {{W}_{1}}$. Therefore, the following holds true:

$$\forall g \in G:\left( {\left| {fr} \right| = \frac{{\left| p \right|}}{k},g \subseteq fr} \right) \Rightarrow fr \in {{W}_{1}}.$$

Since for any near duplicate there is an element of ${{W}_{1}}$ that does not only intersect with this duplicate, but contains it completely, criterion 2 is satisfied for ${{W}_{1}}$ if we consider it as the set $R$.

Lemma 4.For any$p \in D$, $k \in (1{\text{/}}\sqrt 3 ,1]$andnearduplicate group $G$ of fragment $p$ with similarity$k$ (Definition. 3) the criterion of completeness 2 is satisfied in respect to the output of phase 2.

Proof. We omit a formal proof due to its large size. The main idea here is considering the worst case where during “shrinking,” the length of ${{w}_{1}} \in {{W}_{1}}$ elements is decreased to k|p|. Considering corner cases (an element positioned in the center or right at the ends of ${{w}_{1}}$) allows us to confirm that the lemma holds true.

Lemma 5.For any$p \in D$, $k \in (1{\text{/}}\sqrt 3 ,1]$and near duplicate group $G$ of fragment $p$ with similarity$k$ (Definition. 3) the criterion of completeness 2 is satisfied in respect to the output of phase 3.

Proof. Phase 3 consists of element deletion from ${{W}_{2}}$ only. The intervals of the deleted elements are contained in intervals of other elements. It is obvious that if ${{W}_{2}}$ satisfied the criterion, then ${{W}_{3}}$ will satisfy it as well.

Theorem 1.The criterion of completeness is satisfied for any$p \in D$, $k \in (1{\text{/}}\sqrt 3 ,1]$, corresponding algorithm output $R$ and any near duplicate group G of fragment p with similarity k.

Proof. The output of phases 1–3 was proven to satisfy the criterion 2 in lemmas 3–5.

6.3 Optimizing the Algorithm

The proposed algorithm turned out to be inadequate performance-wise: its run time exceeded one hour when searching for patterns larger than 100 symbols in documents of about 2 MB in size. Furthermore, the algorithm produced many false positives – its output contained the same text fragments that were insignificantly shifted relatively to each other. As the result, a range of optimizations has been suggested.

Optimization 1 is applied during phase 1 (scanning). It allows to reduce the number of calculations of $d$, significantly improving the run time of the algorithm. It is based on the known Boyer-Moore algorithm, which is intended for matching a pattern in a string [38]: during the scan, a check is performed to see by how many symbols the window can be shifted without skipping the required result. Therefore, at each step of the scan we check whether d(w, p) > ${{k}_{{{\text{di}}}}} + 1$ ($w$ is the window position) holds. If it is true, then we slide the window by $(d\left( {w,p} \right) - {{k}_{{{\text{di}}}}}){\text{/}}2$ symbols to the right. Otherwise, we slide it by one symbol.

Optimization 2 is applied during phase 2 (“shrinking”). It allows to reduce the number of computations of $d$ as well. The approach is similar to the one used in the previous optimization. During “shrinking” of a text fragment ${{w}_{1}}$, the window scans the fragment in a symbol-by-symbol manner. At each step $d(p,w_{2}^{'})$ is computed and, if necessary, its minimum value ${{d}_{{\min}}}$ is updated. If for the current window position $w_{2}^{'}$, $d(p,w_{2}^{'})$ > d_min + 1 holds true, slide the window to the right by $(d(p,w_{2}^{'}) - {{d}_{{\min}}}){\text{/}}2$ symbols. Otherwise, slide it by one symbol. The ${{d}_{{\min}}}$ value is updated at the beginning of each iteration corresponding to the next value of the sliding window width.

Optimization 3 is applied during phase 3 (filtering). It allows to minimize the cardinality of ${{W}_{3}}$. It is as follows: the set is divided into maximum subsets that are transitively closed under intersection. Further, for every such subset a ${{w}_{3}}$ fragment with the minimum value of $d({{w}_{3}},p)$ is selected, or if there are several such fragments, the one with maximum length. All remaining elements of the set are deleted.

Optimization 4 is applied during phase 3, extending all text fragments of ${{W}_{3}}$ up to complete words. The bounds of a text fragment can ignore the bounds of words, i.e. incomplete words can be included into text fragments. In order to addressthis, text fragments are expanded to include these words fully. This helps to decrease the number of false positives in the algorithm’s output.

Optimization 5 is applied during phases 1 and 2. Its purpose is reusing $d$ for the same strings and parallelizing the “shrinking” of the elements of ${{W}_{1}}$.

Let us show how these optimizations affect the algorithm’s completeness.

Theorem 2.Optimizations 1, 2, 4, 5 preserve the completeness property.

Proof. Consider two strings that are results of concatenation: ${{s}_{1}} = ab$ and ${{s}_{2}} = bc$, where $\left| a \right| = \left| c \right|$ and $d({{s}_{1}},{{s}_{2}})$ = d. We can easily show that $\left| a \right| + \left| c \right| \geqslant d$. Using this fact, it is easy to prove the completeness of optimization 1. The completeness of optimization 2 is proven in the same way. The completeness of optimization 4 can not be doubted because it only extends the elements of the output. Finally, optimization 5 is complete because it only considers the implementation of the algorithm.

Note 1. Situations where optimization 3 does not satisfy the criterion of completeness are possible, but our experiments show that their number is insignificant in practice.

7 ALGORITHM COMPLEXITY

Document length $\left| D \right|$, pattern length $\left| p \right|$, the k value, and the cardinality of the near duplicate group of the pattern $\left| {{{G}_{p}}} \right|$ are all significant parameters that influence the run time of the algorithm. Let us estimate the algorithm’s complexity depending on these parameters.

The average complexity of calculating d (i.e. edit distance) is $\mathcal{O}({{\left| p \right|}^{2}})$ [32]. Consequently, the average complexity of phase 1 is proportional to ${{\left| p \right|}^{2}}$ and $\left| D \right|$. During phase 2 all of the internal fragments of each ${{w}_{1}} \in {{W}_{1}}$ are iterated over, and it is easy to show that their number is proportional to $\left| p \right|$. Furthermore, the cardinality of the set ${{W}_{1}}$ is proportional to $\left| {{{G}_{p}}} \right|$ and ${{k}_{{{\text{di}}}}}$. Finally, the complexity of phase 2 is $\mathcal{O}\left( {\left| {{{G}_{p}}} \right|} \right)$ and $\mathcal{O}({{\left| p \right|}^{4}})$. The complexity of phase 3 operations is $\mathcal{O}\left( {\left| {{{W}_{2}}} \right| * \log\left| {{{W}_{2}}} \right|} \right)$, but because $\left| {{{W}_{2}}} \right| = \left| {{{W}_{1}}} \right|$, the complexity of phase 3 is $\mathcal{O}\left( {\left| {{{G}_{p}}} \right| * \log\left| {{{G}_{p}}} \right|} \right)$ and $\mathcal{O}\left( {\left| p \right| * \log\left| p \right|} \right)$. Optimizations 1 and 2 on average lead to “skips” during iteration, the size of which is proportional to ${{k}_{{{\text{di}}}}}$ (and hence ${\text{|}}p{\text{|}}$), making the complexity of phases 1 and 2 $\mathcal{O}\left( {\left| p \right|} \right)$ and $\mathcal{O}({{\left| p \right|}^{3}})$ respectively. Therefore, with k = const the algorithm’s run time can be estimated as $\mathcal{O}\left( {\left| D \right|} \right)$, $\mathcal{O}({{\left| p \right|}^{3}})$, and $\mathcal{O}\left( {\left| {{{G}_{p}}} \right| * \log\left| {{{G}_{p}}} \right|} \right)$.

Theorem 3. The complexity of the algorithm with fixed D and p is estimated as $\mathcal{O}(1{\text{/}}{{k}^{4}})$ on average.

We omit the proof due to its large volume.

8 EVALUATION

Theoretical complexity estimates are not sufficient for determining the real run time of the proposed algorithm. These estimates were produced using certain significant parameters of the algorithm independently to simplify the proofs, while real complexity can depend on their combinations. Another argument for the necessity of experimental evaluation is the fact that theoretical estimates do not provide the real value intervals of these parameters. Finally, other properties of the algorithm need to be evaluated as well.

We have conducted our experiments to answer the following questions:

(i) what is the run time of the pattern matching algorithm on real data;

(ii) how large are algorithm’s outputs having real data as input.

The first question is important because the algorithm is used in interactive mode, and therefore its run time should not exceed several minutes. Considering output volume, we have proven that our algorithm’s output contains all existing near duplicates of a certain pattern. It is, however, unclear, how exactly large are the real outputs of the algorithm – outputs that contain over 100 elements become more or less unfeasible for human analysis. In turn, output volume is affected by the number of false positive matches and the number of near duplicates in the document

We have experimented on 19 industrial documents both in Russian and English (described in reference [16]). The experiments were conducted on a computer with the following specifications: Intel Core i7 2600, 3.4 GHz, 16 GB RAM. The documents were converted into “flat text” (UTF-8 encoding) with Pandoc [39]. After the convertation, the size of the documents ranged from 0.04 MB to 2.5 MB (0.75 MB on average). We are inclined to think that these numbers are realistic for ${\text{|}}D{\text{|}}$. However, we should note that it is necessary to create a more representative selection of different documentation types in order to obtain more precise estimates.

The experiments were conducted as follows. We have run the algorithm for patterns of length ranging from 50 to 1000 symbols with a 50-symbol step. A 1000-symbol fragment is about 25% of a page of a docx document, i.e. it is a large fragment, and following our experiments, duplicates are significantly smaller in general. We have iterated the similarity measure value k from 0.6 to 1 with 0.1 step for each selected pattern in each document. We have selected the pattern in the following way. Having a fixed pattern length, we followed our technique and selected the “warmest” area in the document of this length, calculating it automatically as a fragment where the following expression reaches its maximum value: $\sum\nolimits_{t \in fr} \,h(t)$. In this expression $t$ is a token of fragment $fr$ and $h(t)$ is its temperature. The sum is calculated over all tokens of the fragment, including possibly incomplete leftmost and rightmost tokens.

Analyzing the data obtained from the experiments to answer the question whether the algorithm’s run time is suitable for interactivity, we have established the following: in 38% of cases the algorithm ran for less than 5 s, in 78% cases – less than 30 s, in 90% of cases – less than 2 min. These run times are fairly adequate for interactive mode.

We have obtained the following data on the output volumes of our algorithm: 84% of outputs contained less than 100 elements, 5% outputs – from 100 to 200 elements, 5.6% – from 200 to 600 elements, 5.4% – from 600 to 1000 elements. Thus, the majority of near duplicate groups in software documentation are relatively small (containing up to 100 elements), which follows from Theorem 1 and our experimental results.

CONCLUSIONS

In this study we have presented an interactive near duplicate search process for software documentation. This process solves the problem of meaningful extraction of near duplicates by involving the user, who can use an automatically generated heat map of exact duplicates to detect the most probable occurrences of near duplicates. We have created a pattern-based near duplicate search algorithm and provided optimizations for it. We have proven the completeness of the algorithm, meaning that all near duplicates contained in the document are present in the algorithm’s output. More precisely, duplicates located in the document significantly intersect with particular elements of the output, and this is why the user can identify them with ease. Our process allows user to manually edit their bounds and to include them in the output in full. We present complexity estimates for our algorithm as well as experimental results. These results suggest that duplicate groups in software documentation generally do not exceed 100 elements, and the algorithm itself performs adequately for practical use.

In the future, we plan to study different types of software documentation in detail using our algorithm and experiment model (focusing on API documentation first). We also intend to thoroughly examine the behavior of our algorithm with varying input parameters (pattern length and similarity measure). Finally, we plan to switch to automatic methods of detecting meaningful duplicates via machine learning. A detailed analysis of different types of near duplicates in different software documentation types is required as well. Other fruitful areas for future work are integration of documentation reuse (in the context of requirement development) with automatic test development [40, 41], and visualization of duplicate structure using diagrams [42].

Notes

We only consider groups that consist of fragments longer than four tokens, because, according to our experiments [15], this particular constraint filters out many false positives.
The $1{\text{/}}\sqrt 3 \approx 0.577$ value was selected for the convenience of the following proofs; following our experiments, we have concluded that if the similarity measure is less than 1/2, then, for smaller patterns (up to 15–20 tokens), the algorithm produces many non-meaningful matches; the lower bound wehave selected is insignificantly larger than 1/2.
Here and further we do not round the lengths of theintervals to integers to save up space. Nevertheless, all proofs can be performed with rounded values as well.

REFERENCES

Brooks, F.P., The Mythical Man-Month: Essays on Software Engineering, Addison-Wesley, 1975.
Google Scholar
Parnas, D.L., Precise documentation: The key to Better Software, The Future of Software Engineering, Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 125–148.
Google Scholar
Bassett, P.G., Framing Software Reuse: Lessons from the Real World, Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 1997.
Google Scholar
Jarzabek, S. and Pettersson, U., Research journey towards industrial application of reuse technique, ICSE, 2006, pp. 608–611.
Book Google Scholar
Irshad, M., Petersen, K., and Poulding, S.M., A systematic literature review of software requirements reuse approaches, Inform. Software Technol., 2018, vol. 93, pp. 223–245.
Article Google Scholar
Horie, M. and Chiba, S., Tool support for crosscutting concerns of API documentation, Proceedings of the 9th International Conference on Aspect-Oriented Software Development, New York, NY, USA: ACM, 2010, pp. 97–108.
Oumaziz, M.A., Charpentier, A., Falleri, J.-R., and Blanc, X., Documentation reuse: Hot or not? An empirical study, Mastering Scale and Complexity in Software Reuse: 16th International Conference on Software Reuse, Springer, 2017, pp. 12–27.
Koznov, D.V. and Romanovsky, K.Yu., DocLine: A method for software product lines documentation development, Programming and Computer Software, 2008, vol. 34, no. 4, pp. 216–224.
Article MATH Google Scholar
Romanovsky, K., Koznov, D., and Minchin, L., Refactoring the documentation of Software Product Lines, Lect. Notes Computer Science, Berlin, Heidelberg: Springer-Verlag, 2011, vol. 4980, pp. 158–170.
Google Scholar
Jarzabek, S. and Dan, D., Documentation Reuse: Managing Similar Documents, FedCSIS, 2017, pp. 1325–1334.
Juergens, E., Deissenboeck, F., Feilkas, M., Hummel, B., Schaetz, B., Wagner, S., Domann, C., and Streit, J., Can clone detection support quality assessments of requirements specifications?, Proceedings of ACM/IEEE 32nd International Conference on Software Engineering, 2010, vol. 2, pp. 79–88.
Nosál’, M. and Porubän, J., Reusable software documentation with phrase annotations, Central Eur. J. Comp. Sci., 2014, vol. 4, no. 4, pp. 242–258.
Google Scholar
Nosál’, M. and Porubän, J., Preliminary report on empirical study of repeated fragments in internal documentation, Proceedings of Federated Conference on Computer Science and Information Systems, 2016, pp. 1573–1576.
Wingkvist, A., Lowe, W., Ericsson, M., and Lincke, R., Analysis and visualization of information quality of technical documentation, Proceedings of the 4th European Conference on Information Management and Evaluation, 2010, pp. 388–396.
Koznov, D., Luciv, D., Basit, H.A., Lieh, O.E., and Smirnov, M., Clone Detection in Reuse of Software Technical Documentation, International Andrei Ershov Memorial Conference on Perspectives of System Informatics (2015), Springer Nature, 2016, vol. 9609 of Lecture Notes in Computer Science, pp. 170–185.
Luciv, D.V., Koznov, D.V., Chernishev, G.A., Terekhov, A.N., Romanovsky, K.Yu., and Grigoriev, D.A., Detecting near duplicates in software documentation, Programming and Comput. Software, 2018, vol. 44, no. 5.
Article MathSciNet Google Scholar
Koznov, D.V., Luciv, D.V., and Chernishev, G.A., Duplicate management in software documentation maintenance, Proceedings of V International Conference Actual Problems of System and Software Engineering, vol. 1989: CEUR Workshop Proceedings, 2017, pp. 195–201.
Duplicate Finder. http://www.math.spbu.ru/user/kromanovsky/docline /index.html.
Luciv, D.V., Koznov, D.V., Chernishev, G.A., Basit, H.A., Romanovsky, K.Yu., and Terekhov, A.N., Poster: Duplicate finder toolkit, Proceedings of the International Conference on Software Engineering (ICSE 2018), 2018, pp. 171–172.
Basit, H.A., Puglisi, S.J., Smyth, W.F., Turpin, A., and Jarzabek, S., Efficient token based clone detection with flexible tokenization, Proceedings of the 6th Joint Meeting on European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering: Companion Papers, New York, NY, USA: 2007, pp. 513–516.
Rago, A., Marcos, C., and Diaz-Pace, J.A., Identifying duplicate functionality in textual use cases by aligning semantic actions, Software and Systems Modeling, 2016, vol. 15, no. 2, pp. 579–603.
Article Google Scholar
Ukkonen, E., Finding approximate patterns in strings, J. Algorithms, 1985, vol. 6, no. 1, pp. 132–137.
Article MathSciNet MATH Google Scholar
Broder, A.Z., On the resemblance and containment of documents, Compression and Complexity of Sequences 1997,Proceedings, IEEE, 1997, pp. 21–29.
Wu, S. and Manber, U., Fast Text Searching: Allowing Errors, Commun. ACM, 1992, vol. 35, no. 10, pp. 83–91.
Article Google Scholar
Landau, G.M. and Vishkin, U., Fast string matching with k differences, J. Comput. System Sci. 1988, vol. 37, no. 1, pp. 63–78.
Article MathSciNet MATH Google Scholar
Myers, G., A Fast bit-vector algorithm for approximate string matching based on dynamic programming, J. ACM, 1999, vol. 46, no. 3, pp. 395–415.
Article MathSciNet MATH Google Scholar
Levenshtein, V., Binary codes capable of correcting spurious insertions and deletions of ones, Problems Inform. Transmission, 1965, vol. 1, pp. 8–17.
MATH Google Scholar
Smyth, W., Computing Patterns in Strings, Addison-Wesley, 2003.
Google Scholar
Bergroth, L., Hakonen, H., and Raita, T., A survey of longest common subsequence algorithms, String Processing and Information Retrieval,2000(SPIRE 2000), Proceedings, Seventh International Symposium on, 2000, pp. 39–48.
Leskovec, J., Rajaraman, A., and Ullman, J.D., Mining of Massive Datasets, Cambridge: Cambridge Univ. Press, 2014.
Book Google Scholar
Gusfield, D., Algorithms on Strings, Trees, and Sequences, Cambridge: Cambridge Univ. Press, 1997.
Book MATH Google Scholar
Ratcliff, J.W. and Metzener, D.E., Pattern matching: The Gestalt approach, Dr. Dobb’s J., 1988, vol. 13, no. 7, pp. 46–72.
Google Scholar
Abboud, A., Backurs, A., and Williams, V.V., Tight hardness results for LCS and other sequence similarity measures, Foundations of Computer Science (FOCS),2015IEEE 56th Annual Symposium on, 2015, pp. 59–78.
Python DiffLib module. https:// docs.python.org/3/library/difflib.html.
Abouelhoda, M.I., Kurtz, S., and Ohlebusch, E., Replacing suffix trees with enhanced suffix arrays, J. Discrete Algorithms, 2004, vol. 2, no. 1, pp. 53–86.
Article MathSciNet MATH Google Scholar
Špakov, O. and Miniotas, D., Visualization of eye gaze data using heat maps, Elektronika ir elektrotechnika, 2007, pp. 55–58.
Luciv, D.V., Detecting Near Duplicates in Software Documentation, 2017. arXiv: 1711.04705.
Boyer, R.S. and Moore, J.S., A fast string searching algorithm, Commun. ACM, 1977, vol. 20, no. 10, pp. 762–772.
Article MATH Google Scholar
Pandoc: A Universal Document Converter. https://pandoc.org/.
Drobintsev, P. D. A formal approach to test scenarios generation based on guides / P. D. Drobintsev, V. P. Kotlyarov, A. A. Letichevsky // Automatic Control and Computer Sciences, 2014, Dec., vol. 48, no. 7, pp. 415–423.
Article Google Scholar
Pakulin, N.V. and Tugaenko, A.N., Model-based testing of Internet Mail Protocols, Proc. Inst. System Programming, 2011, vol. 20, pp. 125–141.
Google Scholar
Gorovoy, V.A., Bolotnikova, E.S., and Gavrilova, T.A., To a method of evaluating ontologies, J. Comput. Systems Sci. Int., 2011, vol. 50, no. 3, pp. 448–461.
Article MathSciNet Google Scholar

Download references

Funding

This work is partially supported by RFBR grant 16‑01-00304.

Author information

Authors and Affiliations

St. Peterburg University, 199034, St Petersburg, Russia
D. V. Luciv, D. V. Koznov, A. A. Shelikhovskii, K. Yu. Romanovsky, G. A. Chernishev, A. N. Terekhov, D. A. Grigoriev, A. N. Smirnova, D. V. Borovkov & A. I. Vasenina

Authors

D. V. Luciv
View author publications
You can also search for this author in PubMed Google Scholar
D. V. Koznov
View author publications
You can also search for this author in PubMed Google Scholar
A. A. Shelikhovskii
View author publications
You can also search for this author in PubMed Google Scholar
K. Yu. Romanovsky
View author publications
You can also search for this author in PubMed Google Scholar
G. A. Chernishev
View author publications
You can also search for this author in PubMed Google Scholar
A. N. Terekhov
View author publications
You can also search for this author in PubMed Google Scholar
D. A. Grigoriev
View author publications
You can also search for this author in PubMed Google Scholar
A. N. Smirnova
View author publications
You can also search for this author in PubMed Google Scholar
D. V. Borovkov
View author publications
You can also search for this author in PubMed Google Scholar
A. I. Vasenina
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to D. V. Luciv, D. V. Koznov, A. A. Shelikhovskii, K. Yu. Romanovsky, G. A. Chernishev, A. N. Terekhov, D. A. Grigoriev, A. N. Smirnova, D. V. Borovkov or A. I. Vasenina.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luciv, D.V., Koznov, D.V., Shelikhovskii, A.A. et al. Interactive Near Duplicate Search in Software Documentation. Program Comput Soft 45, 346–355 (2019). https://doi.org/10.1134/S0361768819060045

Download citation

Received: 28 June 2018
Revised: 05 September 2018
Accepted: 05 September 2018
Published: 03 December 2019
Issue Date: November 2019
DOI: https://doi.org/10.1134/S0361768819060045

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Algorithm 1: Pattern based near duplicatesearch algorithm
	Input data:\(D\) – document,
	\(p\) – pattern, \(k\) – similarity measure
	Result: \(R\)
	// Phase 1 (scanning)
1	\({{W}_{1}} \leftarrow \varnothing \)
2	for \(\forall {{w}_{1}}:{{w}_{1}} \in D \wedge \left\| {{{w}_{1}}} \right\| = {{L}_{w}}\)do
3		if \({{{\text{d}}}_{{di}}}({{w}_{1}},p) \leqslant {{k}_{{di}}}\)then
4			add \({{w}_{1}}\) to \({{W}_{1}}\)
	// Phase 2 (“shrinking”)
5	\({{W}_{2}} \leftarrow \varnothing \)
6	for \(w \in {{W}_{1}}\)do
7		\(w_{2}^{'} \leftarrow w\)
8		for \(l \in I\)do
9			for \(\forall {{w}_{2}}:{{w}_{2}} \subseteq w \wedge \left\| {{{w}_{2}}} \right\| = l\)do
10				if \({\text{Compare}}({{w}_{2}},w_{2}^{'},p)\)then
11					\(w_{2}^{'} \leftarrow {{w}_{2}}\)
12				add \(w_{2}^{'}\) to \({{W}_{2}}\)
	// Phase 3 (filtering)
13	\({{W}_{3}} \leftarrow {\text{Unique}}({{W}_{2}})\)
14	for \({{w}_{3}} \in {{W}_{3}}\)do
15		if \(\exists w_{3}^{'} \in {{W}_{3}}:{{w}_{3}} \subset w_{3}^{'}\)then
16			remove \({{w}_{3}}\) from \({{W}_{3}}\)
17	\(R \leftarrow {{W}_{3}}\)