Keywords

1 Introduction

Document forgery is a ubiquitous and noteworthy problem in today’s age. It is the process of fabricating, transforming or imitating writings, objects, or documents [16]. False making entails fraudulently writing on an official document or making subtle changes to the original document. False making brings the authenticity of a document into question. More often than not it involves falsifying or forging signatures [28]. This type of forgery directly correlates to the theft of the identity of an individual. Forgery can also include backdating, overwriting, and creating addendums to the document. Thus, a different ink is used to modify an already existing document resulting in a forgery of ink.

Broadly speaking, there are two methods to pin-point or identify forgery namely: destructive methods and non-destructive methods. Destructive methods carry out analysis of the ink, usually in laboratories [1] or similar settings. However, this method doesn’t always abide by the ethical code of conduct [11]. This is because they are by default intrusive and involve taking a sample of the document, like a punch in the paper, and analysing it separately or disturbing the ink on the entire document [10]. Analysis of such kind can damage or alter the entire document permanently. The most common method of destructive ink testing is thin layer chromatography (TLC) [1].

Albeit, non-destructive methods are non-intrusive and do not change the document fundamentally [35]. The document is left intact and can be re-evaluated in the future. This method is faster and cheaper [32]. The most prevalent technique for non-destructive analysis is spectrophotometry [6]. Spectroscopy, which is the study of the relation between matter and radiation is an umbrella term for spectrophotometry [5].Footnote 1

Hyperspectral imaging (HSI) is a type of spectral imaging capable of leveraging both spatial and spectral data from materials or objects [18]. With the ability to analyse a wide range of the electromagnetic spectrum [7, 14], HSI was first utilised in remote sensing technology [29]. Unlike normal imaging, it can extend the number of bands scanned to up to hundreds and thousands [4, 34]. As the bands analysed are narrow and contiguous in nature, HSI effectively records subtle variations in reflected energy.

2 Literature Survey

This section explains some of the previous work done in detecting forgery in documents using hyperspectral technology.

Khan et al. [25] created the first publicly available dataset for document forgery detection using hyperspectral imaging which is known as “UWA Writing Ink Hyperspectral Image Database”. In order to reduce dimensionality and extract crucial features from the dataset, they used Joint Sparse Principal Component Analysis (JSPCA). Similarly, to identify different inks, Joint Sparse Band Selection (JSBS) technique was implemented. This technique achieved an accuracy of around 85%.

Devassy et al. [12] created their own document dataset using hyperspectral technology. The data was then normalised using the standard reference target technique. After which, this data was fed into a one dimensional convolutional neural network (CNN) and achieved 91% accuracy.

In [22], six different CNN’s were implemented using the UWA Writing Ink Hyperspectral Image Database. It achieved 99.2% accuracy for blue inks and 99.6% accuracy for black inks. Yet this approach is limited as it requires prior knowledge for training the neural networks.

Luo et al. [30] also utilised the UWA’s Image Database where ink distinction in a local area was carried out using anomaly detection. The study concluded that Influenced Outlierness (INFLO) anomaly detection method integrated with point-to-point (P2P) criterion for feature selection gave the highest results. This technique also tackles the apriori problem via clustering.

UWA’s Image Database is also implemented in [33], yet in this technique orthogonal and graph regularised Non-negative Matrix Factorisation Model is applied. It achieves an accuracy of around 85%, but lacks in finding the optimal hyperparameter selection.

Hyperspectral data was fetched using the neighboring pixel’s spectral responses in [20]. They also took advantage of the WIHSI database. They fed the responses to convolutional neural networks (CNN) to garner the writer’s identity with the highest accuracy of 71%. However, an extreme limit of this work is that the dataset picked was not huge enough to affirm the model’s certainty.

The utilisation of Least Square Support Vector Machines (LS-SVM) for ink detection using hyperspectral data was purposed by Morales et al. [31]. To carry out their work, they formed a dataset from scratch where they created hyperspectral curves by removing the background. Smoothening procedures were applied to the curves to extract 21 crucial pixels. Feature pairs were created by finding the area and features of the slope. Finally, the SVM algorithm was applied on this data. The highest accuracy achieved was of 80%.

Another work whom created their own dataset is of Wang et al. [36]. They focused on noise removal and reducing dimensions by applying Principal Component Analysis (PSA). Then, the dataset underwent psedudo colour synthesis. It created a technique to detect the writer’s identity. The accuracy rate altered due to the changing ink patterns.

3 Dataset Description

For this work, the UWA Writing Inks Hyperspectral Images (WIHSI) database [26] was selected. This database consists of 33 visible band spectrums via 70 hyperspectral images. It contains ten different types of inks, five blue and five black, along with seven different subjects. One one page, the sentence ‘The quick brown fox jumps over the lazy dog’ was written by each subject using five different inks, from varying manufacturers. This was done to ensure they appeared visually similar as they were the same colour (blue or black) yet remained unique.

4 Methodology

The methodology followed to do the work is depicted in Fig. 1.

Fig. 1.
figure 1

Methodology followed in this work

4.1 Preprocessing

On the dataset, background was removed and Sauvola thresholding was applied. It was chosen as it considers unequal illumination and can take crucial data from hyperspectral information efficiently. After which, the five sentences in each document were decomposed into singular sentences for easy analysis in regards to classification. These sentences have been transformed into hypercubes which are 3-dimensional matrices that reflects the 33 spectral bands. Finally, spectral response vector is created of dimensions 1\(\,\times \,\)33.

In order to carry forth testing of ink detection, hypercubes were created with different inks in various ratios as showed in Table 1.

Table 1. Ratios in which inks were mixed

4.2 Dimension Reduction

Dimensionality Reduction is a process that creates subsets of important features which then in turn act like one new attribute. The significance of this technique is that less data is lost in comparison to feature selection but it takes up more space as well. The following elaborates on the various dimension reduction algorithms applied:

Principal Component Analysis. Principal Component Analysis (PCA) is a technique for acquiring significant factors (in the form of components) from an enormous arrangement of factors accessible in an informational collection [37]. It separates low dimensional arrangement of features by taking a projection of unessential measurements from a high dimensional informational collection with a thought process to catch however much data as could be expected. In the end, it aims to take the crucial data from the overall dataset and portrays it as principal components which are essentially a collection of novice orthogonal variables [3]. These showcase the similarity pattern and creates a map of point variables. It is a widely implemented tool, especially in the field of unsupervised applications [13].

Factor Analysis. In this Factor Analysis strategy, correlations are used to gather the variables, i.e., all factors in a specific gathering will have a high relationship among themselves, yet a low relationship with factors of other group(s). In factor analysis, the scale of the variables do not come into play, unlike the orthogonal rotations of each of the factors [17]. According to [21], there are two steps in this process: initially the solution is found, then it is rotated. It is not primarily limited to dimension reduction, but also to understand the various dimensions and to test hypothesis [27].

Independent Component Analysis. Independent Component Analysis (ICA) depends on data hypothesis and is additionally quite possibly the most generally utilized dimensionality decreasing procedures [19]. The significant contrast among PCA and ICA is that PCA searches for uncorrelated components while ICA searches for free factors. On the off chance that two factors are uncorrelated, it implies there is no straight connection between them. In the event that they are free, it implies they are not subject to different factors [9]. For instance, the age of an individual is free of what that individual eats, or how much TV one watches. This calculation expects that the given factors are straight combinations of some obscure idle factors. It likewise accepts that these dormant factors are commonly free, i.e., they are not subject to different factors and thus they are known as the independent segments of the noticed information [8].

4.3 Machine Learning

In this method, there are 10 different inks utilised which are known beforehand. Therefore, supervised machine learning classifiers are applied. Decision Trees and Random Forest were chosen to check how crucial different bands are in detecting inks. K-Nearest Neighbors was chosen to encapsulate document forgery inspection by leveraging neighbors in dimensionality. Similarly, Support Vector Machines (SVM) was selected and their more powerful as they use soft margins and complicated hyper-planes.

5 Results

Table 2 and 3 summarises the quantitative results obtained by applying various machine learning classifiers and dimension reduction algorithms on HSI data. The accuracy used was segmentation accuracy as displayed by Eq. 1 and is defined as the crossing point or association metric, which ascertains the quantity of accurately marked pixels of an ink isolated by the quantity of pixels marked with that ink in either reality or predicted [15]. Overall, better results were obtained for blue inks as compared to black inks, keeping the processes same.

$$\begin{aligned} Accuracy = \frac{True Positives}{True Positives + False Positives + False Negatives} \end{aligned}$$
(1)
Table 2. Black ink accuracy when applied with different processes.
Fig. 2.
figure 2

1:32 ratio for blue inks which displays that dimension reduction helped to increase the accuracy in this study. (Color figure online)

Fig. 3.
figure 3

1:1 ratio for black inks which showcases that dimension reduction helped to increase the accuracy in this study.

5.1 Machine Learning Without Dimension Reduction

After investigating the classification accuracy of classifiers for both inks without dimension reduction the following insights were gained.

  1. 1.

    Out of the three SVM’s kernels implemented, the one with the best performance is the ‘RBF’ kernel for both blue and black inks.

  2. 2.

    After the ‘RBF’ kernel, the next best kernel was ‘linear’ kernel for both blue and black inks in regards to accuracy. The kernel which had the least performance was ‘poly’ kernel.

  3. 3.

    Random Forest outperformed decision trees for both inks in terms of accuracy. This can be due to the combination of many decision trees in random forest.

  4. 4.

    In K-nearest neighbours, the best performance for black inks was when ‘k’ was set to 15.

  5. 5.

    For blue inks, in KNN, the performance varied. For some ratios, ‘k’ was best to be at 5 and for other ratios, it was for when ‘k’ was set to 10.

  6. 6.

    From all of the ML classifiers applied, the global trend was that the blue ink had a pronounced accuracy in comparison to black inks.

  7. 7.

    In all of the accuracies compared, another global trend was that 1:32 had an overall lower accuracy in comparison to the other ratios. This can be due to the fact that in this ratio, the first ink has a very small impact, making it difficult to classify the ink.

5.2 Machine Learning with Dimension Reduction

After applying dimension reduction to the data, the classifiers accuracy performance showcased the following:

  1. 1.

    Out of the three dimension algorithms applied, the best performance was by Factor Analysis.

  2. 2.

    After factor analysis, the next best algorithm was independent component analysis. In fact, both their performances were neck-to-neck. This could be as both algorithms are very similar. The only exception is that while both focus on finding basis vectors, ICA takes to that vector and finds its kurtosis.

  3. 3.

    PCA performed the worst out of all the methods applied. It couldn’t even match the performance of the process with only machine learning. This could be due as PCA doesn’t work with specific vectors nor is it generative like factor analysis.

  4. 4.

    Factor analysis and ICA helped to increase the performance, especially in the 1:32 ratio and overall in black inks. This is depicted in Figs. 2 and 3 respectfully.

5.3 Comparison with Existing Methods

Table 4 showcases this method’s accuracy in contrast with existed processes which also implements the same dataset for ink detection. The proposed technique outflanks the past strategies by accomplishing the most elevated exactness on the blended ink mixes which contains unequal proportions.

Table 3. Blue ink accuracy when applied with different processes.
Table 4. Comparative study of proposed method

6 Conclusions and Future Work

From the investigations carried out, one can safely assume that spectral data is a good enough discriminator of inks and plays an eminent role in document forgery. It can also be concluded that though machine learning classifiers got a great accuracy while identifying ink mismatch.

In the future, more complex models and pre-analysis techniques should be explored for classification such as deep learning. This approach uses only spatial data, hence, the performance may improve using both spatial and spectral data. Another drawback that can be worked upon is the disparity in performance of black and blue inks. To further improve document forgery detection, writer detection can also be studied in the future.