Keywords

1 Introduction

Software documentation is a significant component of modern software. It is supposed to help software engineers to comprehend a given software system and accomplish development and modification tasks more efficiently [1]. There are two types of software documentation: technical documentation (requirement specifications, design documents, etc.), and user documentation (e.g., user guides). Sometimes API documentation is considered, that is a special case of technical documentation and describes application programming interfaces of reusable code libraries [2]. In this paper, we consider technical software documentation only.

It should be noted, that technical documentation may have considerable size and complex structure, and like the software itself, is constantly changed during development process. The quality of technical documentation is a well-known problem that has not been resolved in the last decades [3]. One of the reasons that leads to essential decrease of documentation quality during maintenance process is that documents may contain numerous repetitions. If there is no traceability between duplicate text fragments, we need to modify each fragment manually while making changes. But in practice it is hard to keep the documentation updated because of huge volumes and lack of time. That leads to accumulation of mistakes and contradictions in documentation.

The situation is even more complicated because very often duplicate information is “near duplicate”, e.g., in one document the same software features may be described many times with different level of details. Also, there are sets of similar objects, which are described on the documentation: functions, interruptions, signals, etc. If objects belong to the same set then their descriptions have a lot of commonalities, but at the same time they differ from each other. This leads to repetitions with variations, and makes it difficult to apply usual text search techniques to find such repetitions. Moreover, it is necessary not only to search duplicate text fragments, but to manage them consistently.

Systematic reuse techniques attempt to simplify the software maintenance process. Different techniques of software reuse have been proposed [4, 5]. One of these techniques is XVCL [6], which is based on adaptive reuse introduced by Paul Bassett [7]. These ideas were applied for software documentation reuse in DocLine framework [8]. In this context, refactoring of XML documentation technique was also explored [9] to simplify the maintenance process of existing documentation extracting reusable text fragments. But the challenge of automatically searching for reusable document fragments still remained open.

This paper closes the gap using software clone detection technique [10, 11]. Our approach is designed for operating with XML documentation in DocBook [12] and DRL [8] markup languages, as well as with plain Text (i.e. ASCII/UNICODE format). DocBook is a wide-spread XML language for software documentation development in Linux/Unix community, while DRL is an extension of DocBook for implementation of adaptive reuse approach. We used Clone Miner [13] as a clone detection tool, filtering and correcting its outputs. Based on clones found by the tool, the approach supports refactoring of documentation, i.e. producing customizable reusable elements and inserting them into the text. We consider not only exact duplicates, but also “near duplicate” text fragments, applying adaptive reuse technique [6, 7]. We implemented the approach as a tool that incorporates visualization and navigation facilities on the detected clone groups and provides seamless invocation of DocLine refactoring operations. The paper includes the results of the evaluation the proposed approach whereby we applied our tool for DocBook documentation of several open source projects, and, in particular, for Linux Kernel Documentation (LKD) [14].

2 Related Works

Technical documentation development currently widely employs XML markup languages. The widely used standards are DocBook [12] and DITA [15], both supporting modular approach and enabling development of reusable documentation components. In [8] adaptive software reuse technique of Bassett-Jarzabek [6, 7], has been applied to documentation. But all of these approaches imply that documentation is developed as reusable modules from the very beginning, and they do not offer approaches and tools for searching and extracting repetitions. Meanwhile, documentation maintenance often requires eliminating inconsistencies, because previous corrections were local and made by different persons, in different manners. Searching “near duplicate” text fragments and extracting reusable text elements could simplify the maintenance process. Moreover, this may also lead to correction of descriptions of similar code objects (signals, functions of API, handlers, etc.) for better unification to facilitate future changes. The adaptive reuse technique of XVCL is helpful in this regard. In [9] refactoring of documentation was suggested to extract adaptive reusable elements. But no tools to search repeatable fragments were available.

A systematic review of the software documentation domain is presented in [16]. Below, we overview some of the studies, which provide automatic analysis and transformation of documentation.

Zhong et al. [17] suggested an approach to infer resource specifications from API documentation. The approach overcomes the problem that developers tend to ignore information in API documentation. But if some part of the code is automatically generated on technical documentation, the problem is solved. The paper proposes to generate resource specification on documentation.

An approach to detect documentation errors comparing code samples and corresponding document fragments is proposed in [18]. The approach is based on comparing code objects that are mentioned in the text (data types, procedures, variables, etc.) with the ones in the samples.

Garousi et al. [1] suggests to analyze the usage and quality of software projects’ documentation during development and maintenance phases, based on projects’ data and experts’ opinion from a survey-based questionnaire.

Metrics to measure documentation quality are proposed in [19, 20]. The authors also adapt the VizzAnalyzer clone detection tool [21] to provide a measurement of a documents uniqueness. However, further use of found clones is only briefly discussed and their automatic transformation for future reuse is not done.

To summarize, little attention is given to search repetitions in software technical documentation to extract reusable elements. The issue is only touched upon in [19, 20], but no approach applies the idea of adaptive reuse to software technical documentation.

3 Background

3.1 DocBook

DocBook [12] is a collection of standards and tools for technical writing, particularly used for large and highly structured content. The key difference between DocBook and other structured formats (e.g., LaTeX ) is that the style (bold, font size, italics etc.) is separated from the structured content. This allows one source document to have many presentations, such as HTML, PDF, etc. Unlike other document markup tools, DocBook is not WYSIWYG technology (What You See Is What You Get). It provides more flexibility, and allows to create more reliable documents, but demands for technical writers to be more experienced than Microsoft Word users (discussion about usage of markup languages by technical writers can be found in [22]). DocBook may be easily extended, and it is possible to use these extensions in practice: you only need to perform preprocessing specifications to eliminate extended constructs into plain DocBook, and after that you may use the standard DocBook utilities to get target document presentations (e.g., PDF).

3.2 DocLine

DocLine [8] is created for the development and maintenance of complicated software documentation basing on adaptive reuse [6, 7] to operate with duplicate documentation fragments. Adaptive reuse means that reusable text fragments can be configured for each context where they are inserted.

DocLine provides a new XML markup language DRL, a model of documentation development process, and a toolset integrated into Eclipse IDE. DRL (Documentation Reuse Language) extends DocBook [12] providing two mechanisms of adaptive reuse: customizable information elements and multi-view item catalogs.

Customizable Information Elements. This can be understood with the help of a simple example. Let us consider a news aggregator that provides news feed from different sources. A description of the module to refresh news from RSS and Atom feeds can be the following:

figure a

Meanwhile, the news aggregator can also use Twitter as a news feed, and the description of corresponding module can be as follows:

figure b

To provide reuse of duplicate text in (1) and (2) using an adaptive reuse technique, the corresponding information element must be specified in DRL:

figure c

In this example, we define an information element (<infelement/> tag) and an extension point inside it (<nest/> tag). When this information element is included in a particular context, the extension point can be removed, replaced or appended with custom content without having to modify the information element itself. The following customization transforms(3) into (2):

figure d

The example (4) shows a reference to the information element defined in (3) (<infelemref/>) and the replacement of the extension point defined in this information element by new content (<replace-nest/>).

Multi-view Item Catalogs. In the documentation of most Software products one can find descriptions of typical items of the same kind. To organize adaptive reuse for that case a multi-view item catalog is introduced in DRL. The catalog contains a collection of items represented by a set of attributes. When a technical writer includes a catalog item into a particular context, s/he must indicate the corresponding representation template and the item identifier. Then, the content of the template will be inserted into the target context and all the references to the attributes will be replaced by corresponding attribute values. A particular case of the catalog is a dictionary, which contains a set of terms without presentation templates. Dictionaries are useful for creating glossary to unify naming policy in documentation. More details about multi-view item catalogs can be found in [8, 9].

3.3 Refactoring Documentation

Refactoring is the process of changing a software system in such a way that it does not change the external behavior of the code, yet improves its internal structure [24]. In [9], refactoring was adapted to XML documentation maintenance. In this case, refactoring means the change of internal document specification (XML markup constructs), and preservation of output document presentation (e.g., pdf file). Based on this idea, a number of refactoring operations were designed for DocLine [9].The operations can be divided into the following groups:

  1. 1.

    Operations for extracting common assets, and, in particular, for transition to DRL from plain text or DocBook.

  2. 2.

    Operations to facilitate core assets tuning (extending their configurability).

  3. 3.

    Operations to facilitate the use of small-grained reuse constructions — dictionaries and multi-view item catalogs.

  4. 4.

    Operations for renaming various structural elements of documentation.

3.4 Software Clone Detection and Clone Miner

Very often software is reused by means of copy/paste. It produces duplicate code (software clones), and that may lead to serious maintenance problems. Clone detection methods and tools are aimed to find different kinds of duplicate code to perform refactoring based on reuse techniques. Systematic review of clone detection methods and tools can be found in [11], while interesting discussion about code cloning and clone detection is presented in [10].

This area is quite mature; there are a number of ready-to-use tools. We selected Clone Miner tool [13] as it is a simple command line tool that could easily integrate into the DocLine framework. Clone Miner is a token-based code clone detector. It converts the input source code into a string of lexical tokens and then applies suffix array based string matching algorithms to find the repeated parts of this string as clone groups.

The tool allows varying the minimal length of clones to be searched, measuring it in terms of the number of tokens. A token in the context of text documents is one single word separated from other words by some separator: ‘.’, ‘(’, ‘)’, etc. For example, the following text fragment consists of 2 tokens: “FM registers”.

Clone Miner was extended for this project to support plain text and Unicode inputs, which made it possible to apply the tool to Russian language documents as well.

4 The Process of Clone Detection and Refactoring

4.1 Overview

The general scheme of the process is shown in Fig. 1. The input of the process is a DRL file, which the user prepares for clone detection. After that s/he starts document clone detection by launching Clone Miner which generates the output results. Once the user gets the list of clone groups, s/he can execute the automated refactoring for any clone group. In refactoring, all occurrences of clone selected are replaced by references to reusable element definition.

Fig. 1.
figure 1

Process overview

4.2 Preparation for Clone Detection

DocLine operations are executed for DRL constructs, in particular, the searching of clones is applied to information elements. If the user wants to apply document clone detection for plain text or DocBook documents, s/he has to first perform refactoring operation “Transition to DRL”. As a result, a new information element appears that includes the whole original text. Clone detection is then performed on this information element.

4.3 Clone Detection

We start Clone Miner in the “search in flat text” mode since actually we need flat text search: the repetitions in question might be found inside XML structures. Therefore, the found clones might violate XML markup. (5) shows a text fragment with a found clone emphasized. It includes the start tag but not the end tag. The clones become correct in terms of XML as a result of refactoring operations, and so does their context in the document.

figure e

4.4 Filtering

We use clones detected in refactoring, since it is the technical writer who is responsible for choosing the candidates for refactoring based on their semantic meaningfulness. Meanwhile the number of clone groups detected is so large that they need to be filtered. Our algorithm filters Clone Miner output by the following steps:

  1. 1.

    A clone group is rejected if clone length in the group is less than 5 symbols (e.g. “is a” contains 3 symbols): as a rule, such clones have no semantics, but usually a lot of such groups are found. Some terms can be lost, especially abbreviations, but this is the way to reduce considerably the number of insignificant clone groups. It should be reminded that we measure the clone length in number of tokens in the paper (it means the number of symbol sequences separated by the comma, space, etc.), but in this case we do it in terms of symbols, because the length is too small.

  2. 2.

    We eliminate the groups containing clones consisting only of XML constructs and do not contain output text: we have no task to organize XML markup constructs reuse.

  3. 3.

    We remove the clone groups consisting of phrases “that is”, “there is a”, etc.: these clones have no software semantics (this issue is discussed in Sect. 6). To avoid such clones we have elaborated the dictionary of such expressions based on our own experiments, and we check every clone group if its clones belong to the dictionary. In the future more sophisticated analysis techniques can be used, considering natural-language patterns embedded [23] into strictly defined DRL markup.

4.5 Refactoring

After previous steps, we have a set of clone groups. But our aim is to use clones to extract reusable elements. It can be done using the refactoring process described below. The process uses refactoring operations which have been suggested in [9], but some additional activities have to be performed. The schema of the refactoring process is presented in Fig. 2.

Fig. 2.
figure 2

Refactoring process

Searching Close Pairs. We have the list of clone groups found by Clone Miner in (SetG). To provide adaptive reuse, we search clone groups from SetG where clones are located close to each other. For example, the following phrase can be found in the text 5 times with different variations (various port numbers): “inet daemon can listen on ... port and then transfer the connection to appropriate handler”. In this situation we have 2 clone groups with 5 clones in every group: one group include clones “inet daemon can listen on”, while the other includes “port and then transfer the connection to appropriate handler”. We want to combine these clone groups in a single information element with one extension point to capture different port numbers.

To find out such kinds of clone groups we propose an algorithm. The algorithm works only with two clone groups, because our observations show that this is the most popular case (we plan to extend the algorithm in the future for n clone groups). We define the distance between two clones as the number of symbols between them (we do not consider a case of intersected text fragments). We define the distance between two clone groups \(G_1\) and \(G_2\) under the following constraints:

  1. 1.

    They have the same number of clones: \(\#G_1= \#G_2\).

  2. 2.

    We introduce an ordering for clones in a group based on their appearance in the document, and assign a number to each clone. As a result we have a set of clone pairs, where the first member belongs to one group, the second belongs to another, and for all pairs the first members belong to the same group, and the second members belong to the second one. Clones in the same pair are not intersecting, i.e. they are not overlapping in the text:

    \( \forall k \in (1.. \#G_1) ~ g_1^k~\cap ~g_2^k=\emptyset \),

    where \(g_1^k\) and \(g_2^k\) are members of \(G_1\) and \(G_2\) groups respectively.

  3. 3.

    For every pair of clones (one clone belongs to first group, another belongs to the second group, and both clones have the same number) clone from one group occurs before clone from the other group in the document:

    \((\forall k \in (1.. \#G_1) Before (g_1^k, g_2^k)) ~ \bigvee ~(\forall k \in (1.. \#G_1) Before (g_2^k, g_1^k))\),

    where \(g_1^k\) and \(g_2^k\) are members of \(G_1\) and \(G_2\) groups respectively.

The distance between \(G_1\) and \(G_2\) is \(dist(G_1, G_2) = \max (dist(g_1^k, g_2^k))\), where \(dist(g_1^k, g_2^k)\) is a distance between clones (text fragments) \(g_1^k\) and \(g_2^k\). We use this simple formula, because we have just one special requirement for distance between clone groups: if we choose a clone group it should be possible to compare a distances from this group to others to select the closest one. But we would not like to consider unreal distances, that is, variation of distances between clone pairs for selected groups should not be too big. For example, if the distance between the first clone pair is 1 symbol, and the distance between the second pair is 10 000 symbols then there is no chance, that these pairs are semantically connected, and it would not be sensible to create information element with extension point. Following our experiments, we have defined the maximum of distance variance between clones from two group as constant 2000: \(\mathrm {Var}(\{dist(g_1^k, g_2^k)|k\in (1.. \#G_1), g_1^k \in G_1, g_2^k \in G_2\}) \le 2000\). If the variance is greater, we do not consider this pair.

The algorithm of searching close pairs considers all clone groups from SetG and for every group finds the closest one. If it is successful, a new pair is added to the set PairG. If it is not successful for selected clone group (e.g. there is no other clone group with the same number of clones) the resulting list have no pair with this clone group.

Analysis of Pairs & Clone Groups. In this step we combine the clone group pairs and the initial list of clone groups in a list L to present the information to the user for making decision: what text fragments should be extracted as information elements/dictionaries (elements of L we will call candidates for refactoring or shortly – candidates). The problem is that reusable text elements have to have some semantic, e.g. to be a typical description of a function or interruption. If reuse relies only on syntax and have no semantics, it looks useless. But it is hard to do such analysis automatically, that is why we provide browsing facilities to help the user to make the right decision.

L includes clone group pairs and single groups, which are not included in any pair: \(L = PairG ~ \bigcup ~ \{G ~ | ~ G \in SetG\) \( { \& }~\not \exists P \in PairG: G = left(P) \vee G = right(P)\}\).

We order L by the length of elements measuring length in a symbols in descending order. The length of clone is the number of the symbols in a clone. The length of clone group is sum of lengths of all clones from a group: \(\forall G \in \) L \(length(G)= \#G\cdot {}length(g),\) where \(g \in G,\) and \(\#G\) is a number of clones in G. It should be reminded that all clones from a group are duplicate text fragments, that is why all of them have equal length. The length of the clone group pair is the sum of the lengths of clone groups included in the pair: \(length(Pair(G_1, G_2))= length(G_1) + length(G_2)\). We can see elements at the top of the L, which contain most “amount” of text, and these elements are most preferable for reuse. The user should select manually a group or a pair to perform refactoring operations.

Extraction Information Elements & Dictionaries. For each clone group or clone group pair selected before the user can apply the following refactoring operations (see Sect. 3.3, group 1): extracting information element, extracting information element with variations, or extracting to dictionary.

Before executing these operations, we check if the selected clone group intersects with other clone groups, which have already been used to extract information elements/dictionaries. Clone Miner allows intersection of clone groups, as it has no information what will happen to the detected clones further. But in our case, such intersection leads to mistakes in refactoring operations.

If this checking was successful we perform transformation of selected text fragments to be reused and the remaining context into correct XML. As mentioned earlier, Clone Miner outputs XML-incorrect results, but DocLine can operate only with the correct DocBook/DRL fragments. Generally speaking, our algorithm opens/closes all the necessary tags, both in the clone and in the context from which it is extracted. However, these open/close actions should be “clever”. For example, if we close and reopen the tag (it marks a new paragraph), then we would have two paragraphs instead of one in the resulting text (i.e. text on pure DocBook that is produced after the elimination of DRL constructions and, in particular, after the substitution of reusable information elements). This is the direction of the future work.

Once the refactoring operation for selected candidates is successfully completed, it is removed from L, and document coordinates of the another elements of L are recalculated. After that the user goes back to the “Analysis of pairs & clone groups” step.

5 The Tool

To support the process presented above we implemented a Documentation Refactoring Toolkit [25], and integrated it into DocLine/Eclipse. The tool is implemented in Python and can be invoked as a standalone application (i.e. outside of Eclipse and DocLine). The tool provides navigation over detected candidates, and text browsing facility to observe clones in the source text. It is possible to perform extracting all clones from selected groups into reusable elements, i.e. perform refactoring.

Fig. 3.
figure 3

Documentation refactoring toolkit

The main window is shown on the Fig. 3. The tool is launched for a document, while the title of the document is displayed as the title of the window. The lines of table in the section «Refactoring candidates» correspond to clone groups or pairs found for that document. In the pop up menu for a candidate the user can select a refactoring type – to create either an information element or a dictionary element. If the candidate is a pair, then variations are highlighted as yellow/green pieces of text in the «Candidate text» column.

The «Source text» section shows clones in the source document. If a selected candidate is a clone group then the user needs to select the number of the clone element in the group (see color numbers at the end of the first cell in the «Candidate text» column, Fig. 3). If a candidate is a pair (see the second cell in the «Candidate text» column), then the user needs to select a certain pair by clicking on the corresponding variation. In either case the «Source text» window will display the clone pair of clones in the source document.

6 Evaluation

We did our experiments using hand-made tests and third party DocBook documentation of open source industrial projects. The list of projects and corresponding documentation is presented in Table 1.

Table 1. Documentation used in experiments

Following GQM approach [29] we selected a set of questions to characterize the way of the assessment in our experiments:

  • question 1: quality of documentation clone detection

  • question 2: effectiveness of filtering clones

  • question 3: evaluation of refactoring facilities

Addressing question 1 we did experiments with Clone Miner and DocLine clone search facilities. The first experiment was carried out on hand-made tests for which we know exactly the number and the locations of clones. We found that Clone Miner made some mistakes, e.g., it sometimes skipped the last token in clones. We fixed these errors. After that our tool found correctly all the clones in hand-made tests.

To assess question 2 we used third party documentation listed in Table 1. We used metrics, which are filtering types described in Sect. 4.4. The results are presented in Table 2. It should be noted, that filtering decreases the number of candidates by 13.2 % on average.

Table 2. Filtering results

Numbers of refactoring candidates after filtration are presented in Table 3 for two cases: with the minimal lengths of clones of 1 and 5 (divided by slash in table). In the latter case the numbers of candidates are fewer and the situation looks more operable. However smaller clones, which were excluded in this case, can be used as dictionary elements or in other important situations. Therefore, we recommend that technical writer should work with candidates with the minimal length of clone equal to 1. To simplify operations with a large number of candidates our tool supports ordering by length.

Table 3. Number of candidates in case of minimal length of clone is 1 and 5

Let us consider question 3. We assessed the question using the metric called amount of reuse, which tracks percentages of reused text [30]. We calculate the metric by dividing the amount of reusable text by the total size of documentation. We take all refactoring candidates as reusable text and calculate the amount as \(\sum _{C \in \mathrm {(all~candidates)}} length(C) \), where length(C) is the number of symbols in a clone or a clone pair multiplied by the number of clones in the group (see Sect. 4.5). We measure documentation/text fragments size in symbols. It would be better to measure it in tokens but we had some technical problems with that. The average amount of reusable text for all tested documents is between 48 % and 52.9 %. The results show that when refactoring is carried out in an automatic (straightforward) way, reuse happens to be quite significant. But it is hard to estimate real reuse amount because, as it has been mentioned, technical writer performs additional semantic filtering of candidates for refactoring. To estimate the quality of refactoring more precisely, additional experiments with real project documentation are necessary.

7 Conclusions

Our experiments have shown that even after filtering we have a lot of insignificant clones. Some of them are easy to remove with improved filtering, but others can only be filtered manually. The precision of the algorithm is a baseline for our future work. Support of adaptive reuse should be also extended, e.g. proving extraction of information elements with n extension points, where \(n >1\).

During our experiments, it became clear that our tool should be improved to be more convenient in operating with clone groups, e.g. providing more facilities for construction of information elements.

The proposed approach can be useful in software product line documentation management environment to extract reusable document fragments for documentation of different product line members and organize reusable document structure. It simplifies document maintenance process and, of course, it is meaningful only if maintenance (product line member or/and its documentation) is significant. Our approach can also be used in the context of variability management [31] in software product line development.

Supporting semantic reuse can allow to integrate our approach with various software traceability techniques [32, 33], and mapping document fragments into other software artifacts: code, requirements, model entities, etc. In this case, reuse can improve the quality of this mapping, and semantic-oriented adaptive reuse could increase the granularity of the mapping.

Apart from software engineering, the proposed approach could also be used in such areas as Ontology Engineering [34] or Enterprise Architecture Modeling [35]: usually, models are stored in XML format, and irregular repetitions are also possible here, taking into account that a number of analysts can work with a large volume of information, and a lot of information is unstructured (documents and comments applied to models, long names of model entities, etc.).