1 Introduction

Reconstruction of Cross-Cut Shredded Text Documents (RCCSTD) is an important sub-domain of forensic science and combinatorial optimization, and it plays a significant role in identification, civil disputes, criminal investigations, and so on [7]. It is usually viewed as a special case of the greedy square jigsaw puzzle problem and has attracted the attention of many researchers [21, 22, 26]. Schauer et al. [26] adopted this view of seeing the RCCSTD problem as a variation upon jigsaw puzzles and defined three kinds of shredded documents: manually torn documents; strip shredded documents; and cross-cut shredded documents (see Fig. 1).

Fig. 1
figure 1

Three kinds of shredded documents

With regard to the reconstruction of manually shredded documents, it is possible to trace more than 50 years of research. In its earliest phase the usual approach was to calculate the degree of the adjacent by using information about a fragments’ shape. Wolfson et al. [30] converted the problem of reconstructing manually shredded documents into what is known as the Travelling Salesman Problem (TSP), which focuses on identifying the shortest possible route between a range of points where the pairwise distance is already known. In order to make full use of fragments’ shape information, a registration algorithm among shapes is needed, there are some excellent papers focus on this domain [19, 20]. By doing this they were able to bring to bear an already established research technique, which led to a series of achievements [2, 4, 8]. However, as manually torn documents are not the primary focus of this paper, we will not examine this technique further here.

With regard to the Reconstruction of Strip Shredded Text Documents (RSSTD), the content information at the boundary of the fragments is sufficiently good that this is not generally considered to be a difficult problem to solve. Prandtstetter and Raidl [24] solved the RSSTD problem by treating it as a TSP and using a variable neighborhood search approach with a semi-automatic system in the optimization process.

The RSSTD problem is usually considered to be a special case of the RCCSTD problem. Most of the papers about the RCCSTD problem have been published in the past decade. Prandtstetter proved that the RCCSTD problem is a complete Non-Deterministic Polynomial (NP) problem in his thesis [22]. In a later paper, he solved it using ant colony optimization and a variable neighborhood search [23]. Schauer et al. [26] used a genetic algorithm to solve the problem, drawing upon pattern recognition technology. Gong et al. [5] proposed a memetic algorithm based on evolution algorithms, and defined four kinds of operators and a comprehensive cost function, thereby obtaining some satisfactory results. Zhao et al. [33] proposed a new cost function based on Information Quantity to reduce the serious propagation of errors caused by the matching of shreds with low Information Quantity. This approach was particularly well-suited to the reconstruction of Chinese text documents. At the outset, most researchers used an approach that was based upon Swarm Intelligence Algorithms to obtain optimal solutions for the complex spatial configurations that are typical of the kinds of problems we are discussing here. This approach is usually viable and can get good results, as the above papers have shown. However, as the number of fragments increases or the boundary information about the fragments decreases, it is much harder to find an optimal solution for complex spatial layouts, and the results can be sensitive to how the cost function is defined. Indeed, it appears to be impossible to define a perfect cost function for the RCCSTD problem.

A solution to this is to cluster fragments into several classes, thereby simplifying the complicated spatial searches. Ukovich et al. [28] used a clustering approach as a preprocessing step before applying the actual reconstruction algorithms. Azzam et al. [27], by contrast, proposed a clustering approach for RCCSTD by defining a cost function as the clustering standard. This gave them high splicing accuracy and an efficient running speed. However, Azzam et al.’s approach proves to be sensitive to the definition of the cost function too. Wang et al. [29] suggested a two-stage approach. Here a clustering algorithm based on text lines was used to cluster fragments into several classes (rows). A memetic algorithm was then used to solve the RSSTD problem. Wang et al.’s paper provides a promising approach to the RCCSTD problem, but its realization in the paper was relatively crude. Xu et al. [31] used a clustering vector to classify fragments into several classes. They then used a genetic algorithm to solve the problem. This appears to be an even more promising approach because it is not sensitive to the cost function and can give high accuracy. However, the method appears to be sensitive to the number of fragments in a row or the number of rows in a shredded text document instead. In our view, the most effective approach to the RCCSTD problem is to include two parts: a feature extraction scheme for clustering fragments; and a heuristic algorithm for the RSSTD problem itself. For the former step, there are numerous papers that focus on feature extraction and learning methods [10, 12,13,14, 17, 25]. There are also abundant publications relating to the latter step. Papers regarding machine learning approaches that are relevant to the RSSTD problem include [3, 11, 15, 16, 18].

In this paper will be concentrating on the splicing algorithm for rows and how to make the splicing accuracy of the algorithm more stable as the number of fragments in a row increases. For cross-cut shredded documents, our approach is organized as follows: (1) We extract clustering vectors from the fragments according to certain aspects of how words are positioned in the fragment [31]. The clustering vector of the first fragment in each row is defined as the cluster center. (2) We execute a clustering algorithm to classify fragments into several classes based upon the clustering vectors and associated cluster centers. In other words, we convert the RCCSTD problem into an RSSTD problem. This step reduces the need for a quality cost function, with the possibility of similar or higher accuracy, even if the cost function is relatively poor. (3) We define the cost function between the fragments in the same row, transforming the RSSTD problem into a TSP. (4) We use an ant colony algorithm to solve the TSP that has been derived from the RSSTD problem. (5) We use both a combination strategy and a divide-and-conquer strategy to modify the error from the definition of the cost function. This action not only improves the splicing accuracy in a row but also makes the algorithm more stable as the number of fragments in a row increases. (6) As a final step, we merge the fragments between the rows by using the clustering vector extracted from the rows. The accuracy of the final solution is then assessed manually. A flowchart of the whole process is shown in Fig. 2:

Fig. 2
figure 2

Flowchart of the whole process

The main contribution of this paper is to actualize the feature extraction scheme (vector clustering) first presented in Xu et al. [31] and to make this scheme better suited to the processing of abnormal fragments. Additionally, the paper presents an approach that uses both a combination strategy and a divide-and-conquer strategy to improve splicing accuracy whilst retaining stability as the number of fragments in row increases.

The structure of the paper is as follows: Section 2 introduces our proposed method. In Section 3 we present the results of our experimental testing of the method. Section 4 discusses the results and provide our overall conclusions in Section 5.

2 Method

2.1 Row clustering

2.1.1 The clustering vector

Because the grayscale image matrix data of fragments is very big, we use a clustering vector to describe a fragment. Using feature extraction [31], the image matrix of each fragment can be presented as a 4 × 1 clustering vector. The vector is defined as CV = [a 1, a 2, a 3, a 4]T, where a 1 represents the lower position of an unidentified line word at the top of the fragment and a4 represents the upper position of an unidentified line word at the very bottom of the fragment. a2 then represents the upper position of the last identified line word at the bottom of the fragment and a3 represents the lower position of the last identified line word at the bottom of the fragment (see right of Fig. 3).

Fig. 3
figure 3

The procedures of the feature extraction

The procedures of the feature extraction are showed as follows (see Fig. 3):

  1. P1.

    If the border-top image is white and there is no unidentified line word at the top of the fragment, let a 1 = 0; if not, continue to P2.

  2. P2.

    If the first line word of the fragment is identified, similarly let a 1 = 0 (because the upper position of this line meets the upper boundary of the fragment); otherwise, if the first line word of the fragment is unidentified, let a 1 = l 1, where the l 1 is the lower position of the first unidentified line.

  3. P3.

    If the border-bottom image is white and there is no unidentified line word at the bottom of the fragment, let a 1 = 0; if not, continue to P4.

  4. P4.

    If the last line word of the fragment is identified, also let a 4 = 0 (because the lower position of this line meet the lower boundary of the fragment); else if the last line of the fragment is unidentified, let a 4 = l 4, where the l 4 is the upper position of the last unidentified raw.

  5. P5.

    If there is any identified line word in the fragment, let a 2 = l 2, a 3 = l 3, where l 2 is the upper position of the last identified line at the bottom of the fragment and l 3 is the lower position of the last identified line at the bottom of the fragment. Meanwhile, let l = l 2 − l 3, where l is the word height. Let l  = l4 − l3, where l is the height of inter-row space, then continue to P6; if not, let l 2 = 0, l 3 = 0 with l, l as the mean value from other fragments and continue to P7.

  6. P6.

    If a 3 < L − l − l , where L is the fragment height (because at the end of the paragraph, there may be no word at the end of line (as is visible in fragments 15 to 19 in Fig. 4 below), .so there is a need to modify the clustering vector), modify a 2 = a 2 + l + l , a 3 = a 3 + l + l . Similarly, modify a 1 and a 4. End of procedure.

  7. P7.

    If a 1 = 0, a 4 = 0, it means there is no text content in this fragment, so the fragment can be removed from the RCCSTD problem; if not, we need to modify a 2, a 3 according to a 1, a 4, l, l .

  8. P8.

    If a fragment was rotated 180 degrees, we execute procedure 1 though procegure 7 to get false clustering vector \( {CV}^{\prime }={\left[{a}_1^{\prime },{a}_2^{\prime },{a}_3^{\prime },{a}_4^{\prime}\right]}^T \), the relationship between real clustering vector CV = [a 1, a 2, a 3, a 4]T and false clustering vector CV is shown as follow: \( {a}_1=L-{a}_4^{\prime };{a}_4=L-{a}_1^{\prime };{a}_2=L-{a}_3^{\prime }+l+{l}^{\prime };{a}_3=L-{a}_2^{\prime }+l+{l}^{\prime } \), we can get the real clustering vector based on flase clustering vector.

    Fig. 4
    figure 4

    A set of fragments that include ones where the clustering vector needs to be modified

After all of these procedures, it should be possible to obtain CV = [a 1, a 2, a 3, a 4]T.

2.1.2 The cluster center

The clustering vector of the first fragment in each row can be defined as the cluster center. The first fragment in each row has a number of notable features. For example, the left side of the fragment’s image is white. This being the case, it is easy to discover the first fragment in each row and establish it’s clustering vector. This clustering vector is then named as the cluster center. The cluster center of each row can be defined as \( {CV}_1^{\prime },{CV}_2^{\prime}\dots \dots {CV}_m^{\prime } \) (we assume that the text document is shredded into m × n fragments, so there are m cluster centers).

After establishing the cluster centers, the fragments need to be clustered into m classes. The criterion for this is \( \left|{CV}_i-{CV}_j^{\prime}\right|<{T}_{th} \), where T th is a threshold. In other words, as long as the distance between the clustering vector CV i and the clustering center \( {CV}_j^{\prime } \) remains less than T th , it means that the fragment i and the clustering center \( {CV}_j^{\prime } \) are in the same row. If necessary, the few remaining fragments can be classified by hand.

2.2 Splice in row

Solving the RSSTD problem amounts to transforming a random arrangement of fragments into a correct arrangement of fragments. The reconstruction problem within rows can be modeled as a Traveling Salesman Problem (TSP). Fragments can be converted into vertexes in a graph and adjacent correlations of fragments can be converted into the edges of vertexes. The distance between two vertexes is large when the adjacent correlation of the two fragments is low. By the same token, the distance is small if the adjacent correlation is high. So solving the RSSTD problem is equal to finding the shortest minimum Hamilton circle in a Complete Weighted Graph.

2.2.1 Setting up the TSP model

In order to obtain the adjacency matrix for the TSP it is necessary to calculate the distance (i.e. cost function) between any two vertexes. The distance between two fragments d l (i, j) can be defined as Eqs. (1)–(3).

$$ {d}_l\left(i,j\right)=\sum \limits_{y=1}^L{e}^{\prime}\left(i,j,y\right) $$
(1)
$$ {e}^{\prime}\left(i,j,y\right)=\left\{\begin{array}{c}1\kern0.5em e\left(i,j,y\right)>\tau \\ {}0\kern1em otherwise\end{array}\right. $$
(2)
$$ \mathrm{e}\left(i,j,y\right)=0.7\left|{V}_i(y)-{V}_j(y)\right|+0.1\left|{V}_i\left(y+1\right)-{V}_j\left(y+1\right)\right|+0.1\left|{V}_i\left(y-1\right)-{V}_j\left(y-1\right)\right|+0.05\left|{V}_i\left(y+2\right)-{V}_j\left(y+2\right)\right|+0.05\left|{V}_i\left(y-2\right)-{V}_j\left(y-2\right)\right| $$
(3)

where e(i, j, y) represents the distance at point y between fragment i’s right border and fragment j’s left border. e (i, j, y) is the result of the binarization of e(i, j, y), and τ is a threshold that can be deduced through experience. Note that y in the equation needs to meet the following condition:

$$ \mathrm{y}\in \left[3,L-2\right]\cap \mathrm{y}\in {N}^{\ast } $$
(4)

If it does not, another formula has to be used to calculate e(i, j, y) as follow:

$$ e\left(i,j,y\right)=\left|{V}_i(y)-{V}_j(y)\right| $$
(5)

After these calculations have been completed, the required adjacency matrix can be obtained and the RCCSTD problem for a row can be transformed into a more straightforward TSP. However, one further point needs to be noted: the definition of the distance is one of main sources of error in splicing because it is never possible to perfectly quantify the adjacent correlation between fragments.

2.2.2 The method for the TSP

The Traveling Salesman Problem is a classic problem in the field of combinatorial optimization. It has been proven to be a complete Non-Deterministic Polynomial problem. As a mature domain of research, there are now many possible solutions for a TSP. A number of papers suggest that the ant colony algorithm is a good solution to the TSP [12]. This algorithm is both highly repeatable and accurate, so this is the algorithm we have used to solve the TSP set out above.

2.3 Increase the splicing accuracy in row

As the number of fragments in a row increases, or the border text information for a fragment decreases, the splicing accuracy declines notably [9]. In order to improve the splicing accuracy, we propose the following two solutions: First of all, a better definition of the distance can be found. A number of papers suggest ways in which this might be accomplished [5, 9, 26, 33]. In each of these pattern recognition is considered to be the best approach. Secondly, the TSP solution itself can be made more effective. As mentioned in the introduction to this paper, limited boundary information and a tremendous number of fragments can make it difficult to find an optimal solution for complex spatial configurations, and the results can be sensitive according to how the cost function is defined. In our view, increasing splicing accuracy by redefining the cost function is not the most efficient way to proceed. However, if it is possible to improve the effectiveness of the TSP solution by basing it upon the characteristics of the RCCSTD problem, this would appear to be a more promising way forward. We therefore focus upon the latter approach to increasing splicing accuracy in a row. There are two strategies that might be used to achieve this aim: a combination strategy, and a divide-and-conquer strategy.

2.3.1 The combination strategy

Some papers have pointed out (e.g. [6]) that when the number of cities declines in the travel Traveling Salesman Problem, the fault tolerance for distances in the TSP model increases. A related observation is that, as the number of fragments declines, the effect of the deviation between the definition of the distance and the adjacent correlation to splicing accuracy is weakened. Thus, reducing the number of fragments will improve the splicing accuracy for distances that are defined to be the same.

So how might the number of fragments be reduced? A specific set of operations that can accomplish this are as follows: Whilst the distance between fragments i and j is less than a specified threshold, the fragments i and j can be thought of as adjacent. The two fragments can then be merged into a new fragment (this process is illustrated in Fig. 5). The new fragment takes fragment j’s left border as its own left border and takes fragment i’s right border as its own right border. In this way the number of fragments is reduced, thereby improving the splicing accuracy.

Fig. 5
figure 5

A demonstration of fragment merging

2.3.2 The divide-and-conquer strategy

Usually blank spaces will appear at the ends of paragraphs. Thus, the quantity of information for fragments in the same row can be different. For example, in Fig. 3 fragments 1 to 14 have three lines of words, but fragments 15 to 19 only have two. If fragments within the same model have different quantities of information this can lead to mistakes and decrease the splicing accuracy.

The question arises in that case as to how to reduce the influence of this phenomenon. One of the best ways of tackling the problem is to think of a divide-and-conquer algorithm. By using this the fragments can be divided into 2 parts according to their information content. This not only improves the splicing accuracy, but also reduces the temporal complexity of the algorithm.

2.4 Splice between row

When we have finished splicing in rows, the splicing between rows becomes an easy task. It can be achieved by matching the row fragments’ clustering vectors (CVR i ). For the sake of clarity, the whole matching algorithm is shown below:

figure c

3 The experiments

The main point of this paper is to examine the viability and splicing accuracy of an algorithm that is capable of solving the RCCSTD problem. This being the case, we need to eliminate any interference arising from other factors so as to simplify the problem. There are many factors that can influence splicing accuracy beyond just the algorithm itself. These can include such things as; the loss of paper that is turned to dust by the shredding knives; the skew of the cuts relative to the text lines; the resolution of the scanner; noise in the image arising from the scanning process; and so on. Bearing this in mind, we created a test data set by using a digital simulation of a physical cross-cut shredder. The resolution of the text document’s image was 1368 × 1980.

We considered a set of ten text documents. Five of them were in Chinese, and five in English. areal of the documents were in Times New Roman, 12 Font, with no line spacing. The documents were shredded into 5 × 5, 7 × 7, 8 × 8, 9 × 9, 10 × 10, 11 × 11, 11× 13,11 × 15,11 × 17 and 11 × 19 shreds by a computer to serve as our test data. So as to enable other research teams to replicate the experiment, we are using open data here to demonstrate the process. This will be followed by the statistical results from the actual experiments, which made use of personal text documents.

We are going to use Appendix 3 of China’s Undergraduate Mathematical Contest in Modelling (CUMCM)-2013, problem B [1], as our demonstration data. As space is limited, we will just use the most complicated parts of the document. The document was cut into 11 × 19 fragments. The size of the photo was 180 × 72 pixels, the size of a word in the photo was about 40 × 40 pixels, and the inter-row space height was about 28 pixels. We implemented our method in MATLAB R2012b and performed all tests on a Core i5 2450 M CPU with 4GB of RAM.

3.1 Demonstration of the clustering analysis and exception handling

The first step in the clustering analysis is to obtain the clustering vector, as described in the method presented above. However, as we begin the process, we find some mistakes in how the document was extracted, as shown in Fig. 6.

Fig. 6
figure 6

Demonstration of clustering analysis and exception handling

Due to the characteristics of Chinese, there are some interruptions in the vertical strokes of the word. This phenomenon exists in English too, but it is more obvious in Chinese. As is shown in Fig. 6, if there were no interruption in the extraction, the clustering vector we would obtain for this fragment would be [7,119,159,0]T. However, the real clustering vector for the fragment is [23,119,159,0]T. So we need to know how to modify this kind of error. These kinds of interruptions in vertical strokes can be thought of as a sort of one-dimensional Salt and Pepper Noise. This being the case, we can use an improved one-dimensional median filter to get rid of the noise. Based on the work presented in [34] and its associated experiment, the window length of the median filter is defined as 7, with the result after filtering being shown at the right of Fig. 6.

After dealing with the mistake, we obtain a clustering vector for each fragment and a cluster based upon the cluster center for each row. We set the clustering threshold, T th , to be 1/20 of the vertical resolution of the fragments image that was used for the experiment. With this threshold the fragments can be clustered without mistake. The results are shown in Table 1.

Table 1 The clustering results

3.2 Splicing result of demonstration data

As can be seen in the clustering results shown in the Table 1, we spliced in rows. The definition of the distance can then be used to obtain the adjacency matrix and the splicing problem can be transformed into a TSP. After this, the ant colony algorithm is used to solve the TSP. Referring to the literature [32], the parameters for the ant colony algorithm were set as α = 1, β = 5, ρ = 0.5 and the number of ants was 19. The algorithm was then used to splice the row fragments. In this way we were able to obtain a reconstruction of CUMCM-2013. problem B. Appendix 3 that was correct and that had been accomplished without manual intervention during the splicing (see Fig. 7). The algorithm took 16.43 s to reconstruct the document and an average of 1.43 s to splice each row.

Fig. 7
figure 7

The reconstructed result for CUMCM-2013, problem B, Appendix 3

3.3 Splicing result statistics of test data

Table 2 shows the results obtained by applying our method for the actual test documents. The number of fragments in a wrong position for each document in every shredded pattern is listed in the table. Note that the demonstration text document, labeled C1, is also included in the table.

Table 2 Splicing result statistics of test data

It can be observed that there are two Chinese text documents that can be restored without any errors and without manual intervention across all patterns. At first sight it would appear that reconstructed Chinese text documents give a higher splicing accuracy than reconstructed English text documents.

4 Discussion

4.1 The necessity of two strategies

One of the main concerns of this paper is to explore whether a combination strategy and a divide-and-conquer strategy can together improve the splicing accuracy in a row over the same defined distance. To examine this issue let us begin by looking at the difference between where these two strategies were or were not used in a series of tests. Here all of the test documents were shredded into 11 × 19 shreds, with document C1 continuing to be used for demonstration data.

4.1.1 Splice in row by basic model

First of all, we used the parameters for the ant colony algorithm presented in the section 3.2 as a basic model (without applying either of the potential strategies). The results we obtained for this basic model were as follow: there were 4 rows spliced correctly (Row 1, Row 3, Row 5 and Row 7) and 7 rows contained some mistakes (Row 2, Row 4, Row 6, Row 8, Row 9, Row 10 and Row 11). There were 28 mistakes overall within the 7 Rows that contained errors. For the purposes of discussion, we will focus on row 4 (see Fig. 8). The algorithm using the basic model took 3.23 s to splice a row on average.

Fig. 8
figure 8

Demonstration of the splicing results in the basic model using Row 4

4.1.2 Splicing in rows using the basic model and the divide-and-conquer strategy

As another test we used the basic model together with the divide-and-conquer strategy to splice the fragments. Once again Row 4 is used to illustrate the results (see Fig. 9). In this case the algorithm spent 1.98 s to splice a row on average.

Fig. 9
figure 9

Showing the difference between where the divide-and-conquer strategy is used or not used

As we can see, when the basic model is used together with the combination strategy, the number of mistakes declines significantly. So, the number of mistakes in Row 4 has decreased from 6 to 2. In general, the total number of mistakes dropped from 28 to 23 and there were 2 more rows that were now spliced without errors (Row 8 and Row 10). However, this strategy is only useful when there are fragments in a row with different lines of text. In other word, this strategy cannot be used for every row.

4.1.3 Splice in row by basic model and combination strategy

We use the basic model and combination strategy to splice the fragments and use the Row 4 for demonstration (see Fig. 10). The algorithm with basic model and combination strategy will spend 1.80s to splice a row on average.

Fig. 10
figure 10

Showing the difference between where the combination strategy is used or not used

As we can see, when we are use the basic model and combine with combination strategy, the number of the mistakes is declining. For example, the number of mistakes in Row 4 decrease from 6 to 3. In generally, the total number of the mistakes is dropping from 28 to 7 and there are only 3 rows still have splicing mistake (they are Row 4, Row 8 and Row10).

4.1.4 Splicing in rows using the basic model, the combination strategy and the divide-and-conquer strategy

Finally, let us look at where the basic model, the combination strategy and the divide-and-conquer strategy where all used together to splice the fragments. Yet again we refer to Row 4 to demonstrate (see Fig. 11). This algorithm spent 1.43 s splicing a row on average.

Fig. 11
figure 11

Showing the difference between using just the combination strategy and using both strategies together

As can be seen in Fig. 11, the number of mistakes in Row 4 drops to 0. The total number of mistakes also dropped to 0.

4.1.5 Summarizes the overall differences between using or not using the two strategies

There is a table for contrasting the differences between use or not the two strategies (see Table 3). In order to exclude the possible influence of any accidental factors, we ran the same experiment across another 9 test documents. This produced some differences in the results between Chinese and English documents, so we list the results here separately (See Table 4).

Table 3 The differences between using or not using the two strategies

Table 4 suggests that the efficiency of the two strategies is not the same for Chinese and English documents. Overall, they appear to be more effective for Chinese documents. However, based on the two tables together we can still conclude that the combination strategy and the divide-and-conquer strategy can not only improve splicing accuracy in a row but also reduce the temporal complexity for the algorithm. Thus, there is a clear and significant value in applying the combination strategy and the divide-and-conquer strategy to the RCCSTD problem.

Table 4 Experimental results using another 9 test documents

4.2 Convergence of the algorithm

Convergence is an important property for algorithms, especially heuristic algorithms. An ant colony algorithm was used to solve the RSSTD problem, so it is necessary to test its convergence. This was done using numerical experiments. The parameters of the ant colony algorithm were set as α = 1, β = 5, ρ = 0.5 and the number of ants was 19. The evolution curve for the optimal solution of the ant colony algorithm for Rows 4 and 5 in test document C1 is shown in Fig. 12.

Fig. 12
figure 12

Evolution curve for the optimal solution for our ant colony algorithm

As a result of the experiment detailed in Section 4.1 we know that the cost function for the optimal solution for Rows 4 and 5 is 360.7513 and 354.4565 respectively. Looking at Fig. 12 it can be seen that our algorithm converges in the 25th and 31st iteration to get the optimal solution. Based on the discussion above, we would therefore argue that our algorithm is convergent.

4.3 Comparison between distance definitions

The distance definitions between the fragments could affect the splicing accuracy in a row. It is therefore necessary to test the differences between these distances. We compared our distance definition with the Manhanttan distance, the Euclidean distance and the cosine distance. The number of fragments in a wrong position and the number of correctly spliced documents are listed in Table 5 (all of the test documents were shredded into 11 × 19 shreds).

Table 5 Comparative results for distances

As we can see from Table 5, whilst our methods may not offer the best solution for splicing accuracy, on the basis of our test data our definition achieves the best results for correct document splicing (the effects of accidents cannot be ruled out). Furthermore, the proposed scheme is easy to calculate and understand, reinforcing our choice of this distance for the cost function in row splicing.

4.4 Comparing the splicing accuracy with Xu et al.’s paper [31]

This paper is derived in part from the work of Xu et al. It is therefore also necessary to compare our splicing accuracy with Xu et al.’s original results. In order to control the variables, we use the same definition of as Xu et al. put forward [31] which was as follows:

$$ Accuracy=1-\frac{thenumberoffragmentsinwrongposition}{thetotalnumberoffragments} $$
(8)

We calculated the splicing accuracy for both Chinese and English text documents. The results are shown in Fig. 13.

Fig. 13
figure 13

Splicing accuracy when reconstructing English and Chinese documents shredded into different pieces

We found that the splicing accuracy was more than 0.95 for the Chinese documents and 0.85 for the English ones. This means that reconstructing Chinese text documents gives a higher splicing accuracy than it does for English text documents. There are many reasons for this phenomenon. One is that Chinese text document fragments have more boundary content than English text document fragments when they are both using the same font. Chinese also has a strong “square” quality to its characters, especially in the modern simplified form of Mandarin. Accuracy rates for English documents are in the mid-80% range while the number of fragments raised to 150, it is may not a good result for classic document recognition tasks.

Setting aside the difference between the two languages, note that the accuracy was higher than 0.85 for either case, when the number of fragments was lower than 210. Outside of this, note that the scope of line k1 is −0.0016 bigger than the scope of line k2 (−0.000318) in absolute value. Line k1 represents the splicing accuracy of English text documents that were shredded into fragments of 8 × 8, 9 × 9, 10 × 10, and 11 × 11. Line k2 represents the splicing accuracy of English text documents that were shredded into fragments of 11 × 13, 11 × 15, 11 × 17, and 11 × 19. This means that the splicing accuracy is more likely to be affected by the number of rows than it is by the number of fragments in a row. In other words, the splicing accuracy for our approach decreases more quickly as the number of rows in a shredded text document increases, rather than the number of fragments in a row.

Comparing our results with Xu et al.’s (see Fig. 14), the picture shows that our proposed algorithm gives a higher splicing accuracy than Xu et al.’s approach, especially when the number of fragments is greater than 100.

Fig. 14
figure 14

The precision of reconstructing a document shredded into different pieces according to the work of Xu et al. [31]

5 Conclusion and future work

The algorithm presented in this paper was able to achieve a set reconstruction task - the automatic reconstruction of document CUMCM-2013, problem B, Appendix 3 - without error. A first point of note is that using a combination strategy and a divide-and-conquer strategy together can improve splicing accuracy whilst reducing temporal complexity. A second point of note is that the approach presented in this paper gives a higher splicing accuracy than the approach proposed in similar previous work by Xu et al. [27] because it is not sensitive to the number of fragments in a row. The splicing accuracy was over 0.95 for Chinese test data and over 0.85 for English test data. Thus the splicing accuracy is generally high. However, our approach does have some issues: 1) the test data was not physically cross-cut shredded text documents and this may make a difference, and we overlook many factors which can affect splicing accuracy such as abnormal printing on paper, inclined cutting of paper, dust by the blades and so on; 2) the splicing accuracy may not be sensitive to the number of fragments, but it is sensitive to the number of rows; 3) the narrow margins of the documents will help us identify the first and last fragments in each row, this characteristic make our task easier, not all of the shredded documents have narrow margins; 4) if there are two or more identical cluster centers in one shredded document, it is not possible to finish the clustering and the fragments cannot be reconstructed. The latter issue suggests that the algorithm may not be effective for the reconstruction of multiple shredded documents because of the possibility of identical cluster centers.

In future work we will be looking at how to extract more features to improve the accuracy of the row clustering and will be attempting to address the various issues described above, especially pay more attention on reconstruction cross-cut shredded multiple text document based on simulate data and real data. On the other hand, we need to focus more on reconstruction real shredded text document, there are many issues need to be solve during transforming the real data image to the simulate image which was used in our paper. To sum up, above and beyond all other matters, maintaining a high splicing accuracy is the primary goal when tackling the RCCSTD problem, and this has been the central focus for our approach.