Introduction

Regarding the forensic standpoint, the forensic analysis documents present a real interest, as cases of crimes like falsification, questioned signatures, threatening letters or even terrorist attacks may leave them as evidence (Ortiz-Herrero et al. 2018; Calcerrada and Garcia-Ruiz 2015). As is known, the amount of carbon 14 is a standard for determining the dating of documents and material. However, half-life of carbon 14 is 5730 years (Ward et al. 2018), only materials from 1000 to 5000 years ago can determine the dating through carbon 14. Although it is helpful for archeology, there are still many restrictions on its use for forensic dating. In this study, we aimed to find a method which can provide the same function as carbon 14 but determine the dating of the recent documents within 5 years.

There are two directions to research the dating of documents. One is ink dating which is a well-developed technique to classify the dating of documents. The time of document under specific ink could be detected by several techniques such as HPLC (Andrasko 2001; Liu et al. 2006), GC (Brazeau and Gaudreau 2007), MS (Weyermann et al. 2007), FTIR (Wang et al. 2001) and UV–vis (Senior et al. 2012; Xu et al. 2006), etc. Although the year of document and ink are almost the same, sometimes deviations still exist between both sides. Thus, date tracing by documents itself is very meaningful, and can address the problem of time source.

The other direction is the study of paper itself. Studies on paper degradation have been reported (Calcerrada and Garcia-Ruiz 2015). However, the process of document aging is rather complicated. Main ingredients in documents are not only cellulose, but also include many inorganic fillers to keep proper characteristics (Silva et al. 2018). Therefore, the analysis of documents dating has been studied in a limited range. Studies on cellulose degradation refer to the characterization of cellulose (Missori et al. 2006; Souguir et al. 2017), degradation products (Dupont et al. 2012) and document date. Zięba-Palus et al. (2017) distinguished the documents of different degrees of dating by infrared and Raman spectroscopy and obtained the degradation mechanism of documents by 2D maps. In the study of Ortiz-Herrero and Blanco, py-GC/MS was used for estimating the document age through two different approaches based on paper analysis: (1) a direct method using a PLS of the pyrolytic profiles of synthetic samples correlated with their aging time in chamber, and (2) an indirect method was used to identify the characteristic paper components during different time periods. And they concluded that 5 h in chamber under the experimented conditions were equivalent to one natural year under police custody conditions (Ortiz-Herrero et al. 2018). Laser induced fluorescence spectroscopy was employed to monitor documents fibers in function of their natural dating time (Martínez et al. 2017). FTIR combined with chemometrics was used to predict the dating of document, the root mean squared errors (RMSE) were around 4 years in model (Silva et al. 2018). After building a sPLS model, average spectrum focused on kaolinite, calcium, and carbonate (Silva et al. 2018).

In this study, we proposed a preliminary research to estimate the dating of documents by FTIR. sPLS–LS-SVM was used to acquire stable variables for modeling after 100 repeating calculations. Selected variables were investigated to reveal how they contributed to the dating of documents.

Materials and methods

ATR-FTIR spectroscopy

Both background and samples were measured by attenuated total reflection flourier transformed infrared (ATR-FTIR) spectrometer, in the range of 4000–650 cm−1 at a resolution of 2 cm−1 and each spectrum was average of 16 scans. The samples were pressed against the diamond crystal of the ATR device until a torque knob ensured that the pressure applied was the same for all measurements (Gomez-de Anda et al. 2012). The crystal was cleaned by anhydrous ethanol between successive measurements to avoid minute contamination. The cleaned crystal was checked by running a background spectrum. Spectrum recording was conducted by a Nicolet iS5 ATR-FTIR spectrometer (Thermo Scientific Co, Ltd., U.S.). Spectral data were collected by OMNIC 9.7.43 software.

Samples

According to the RMSE of 4 years found by Silva et al. (2018), the book samples in this study were collected with an interval of 5 years from 1940 to 1980. All the samples were found and measured in the basement of China Agricultural University Library. Because the paper of documents was very thin, seven sheets of paper were stacked as a sample. Triplicate spectra of different positions of each sample (top, middle and bottom) were collected, generating nine spectra every seven sheets. Different parts of each book (front, middle and back) were sampled. Namely, a total of 27 spectra were collected from each book. From 1940 to 1980, five books were chosen for each 5-year interval. At last, a total number of 1215 spectra were obtained for later analysis. All the samples were divided into 810 training sets and 405 test sets after systematic sampling.

Algorithms and modeling

Principal component analysis

Principal component analysis (PCA) is a multivariate analysis method, which provides an unsupervised interpretation without prior assumptions on identifying of different samples (Liu et al. 2018). PCA aims to reduce the dimensionality of the original data space by using a smaller and more efficient abstract space of latent variables, where the data can be displayed and the information of the original space is essentially kept (Mees et al. 2018).

Sparse partial least squares classification analysis

The sPLS based on PLS was developed by Chung and Keles (2010). PLS is a well-known dimension reduction method, and sPLS employs a “sparsity” solution to achieve variable selection and dimension reduction simultaneously. In the process of sPLS, only non-zero variables that contribute to explaining the variability present in the response variable can be selected. Meanwhile, sPLS reduces the noise contained in the irrelevant variables and can be implemented to select variables (Abdel-Rahman et al. 2014).

Least squares support vector machines

LS-SVM is based on standard support vector machine (SVM), which was first proposed by Suykens et al. (2002), and is a widely applicable and functional machine-learning technique for classification and regression (Yang et al. 2018). When using SVM or LS-SVM, there are three crucial problems that need to be solved, namely, the determination of the optimal input feature subset, proper kernel function, and the best kernel parameters. In this study, LS-SVM, radial basis function (RBF) kernel, the optimal combination of gam(c) and sig2 (r2) parameters were chosen when resulting in smaller root mean square error of cross validation (RMSECV) (Camps-Valls 2011).

Model estimation

To characterize prediction ability (efficiency) of classification model, the parameters of non-assigned samples, the error rate and accuracy were used.

Non-assigned samples are some samples which can’t be recognized or classified in the model.

Error rate is the ratio of wrongly assigned samples and is calculated as Eq. 1:

$$ {\text{Error}}\,{\text{rate}} = 1 - \frac{{\mathop \sum \nolimits_{g = 1}^{G} \frac{{n_{gg} }}{{n_{g} }}}}{G} $$
(1)

where G represents the sum of classes, ng is the total number of samples belonging to the g-th class and ngg is the number of samples belonging to class g and correctly assigned to class g.

Accuracy is the ratio of correctly assigned samples, and is calculated as Eq. 2

$$ {\text{Accuracy}} = \frac{{\mathop \sum \nolimits_{g = 1}^{G} n_{gg} }}{n} $$
(2)

where n is the total number of samples. Non-assigned samples are not considered for the accuracy calculation.

Results and discussion

Spectral features

As spectral resolution was set as 2 cm−1, each spectrum would generate a large number of spectra points with a pretty slow scanning process. In order to keep the resolution, but compress the amount of data as well as improve the running speed, an average of five points was merged. Since seven sheets of paper were stacked as a sample, the thickness of samples was quite different. Normalization was used in order to eliminate difference in optical path (Fig. 1).

Fig. 1
figure 1

Spectra with no preprocess (a); spectra after area normalization (b). The abscissa of both figures is wavenumber; the ordinate of the graph (a) is absorbance, the ordinate of the graph (b) is data which can be got after area normalization

As known, the main component in paper documents is cellulose (Liu et al. 2018). Besides, the inorganic compounds are essential to keep documents white and smooth. Among these inorganic compounds found in documents, calcium carbonate (CaCO3) and kaolinite (Si2Al2O5(OH)4) are the most common ingredients (Zięba-Palus et al. 2017; Silva et al. 2018).

Before statistical analysis, spectra of documents samples from 4000 to 650 cm−1 were collected as shown in Fig. 2. There were two significant differences among spectra as black ellipses of Fig. 2. Thus, all spectra can be divided into two types based on the spectral differences. The representative samples in two types were chosen and assigned absorption peaks to chemical compounds (Fig. 3a, b).

Fig. 2
figure 2

Spectra of some samples. The black ellipses show the main differences in the absorption peaks

Fig. 3
figure 3

The document spectrum with kaolinite (a); the document spectrum with carbonate (b). The absorption peaks enclosed by green rectangles belong to cellulose; the parts of blue are assigned to kaolinite; and the yellow rectangles represent carbonate

The same parts in Fig. 3a, b are focused on the green part in Fig. 3a, HCH bending and wagging are observed in the 1200–1500 cm−1 range. Near the range of 1650 cm−1 are the absorption peaks of water. CH stretching in CH, CH2, CH3 are assigned in the 2800–2950 cm−1 range. OH–O intramolecular H bond belongs to 3300–3500 cm−1 range. Based on some research, the absorption peaks enclosed by green rectangles belong to cellulose (Silva et al. 2018). The absorption peaks of 748, 788, 911, 1003, 1028, 3618 and 3690 cm−1 can be assigned to kaolinite (Silva et al. 2018) (The blue part of Fig. 3a). Carbonate contributions can be noticed not only at 875 cm−1 but also at 1410–1420 cm−1. Because the range of 1410–1420 cm−1 overlaps with that at 1420–1430 cm−1 of cellulose (Martínez et al. 2017), no absorption peaks in 1410–1420 cm−1 range were observed, whereas peaks around 875 cm−1 were noted (see, the yellow part of Fig. 3b).

The analyze of principle components analysis

Principle components analysis (PCA) was used to classify the years of book. Figure 4a showed a two–dimensional diagram after PCA. The abscissa is PC1 which contains 82.00% of data information, and the ordinate is PC2 which contains 9.00% of data information. Figure 4b gave the three-dimensional diagram with PC3 which contains 5.00% of information. The samples from different years were grouped together. It is evident that it is impossible to distinguish all kinds of years only from Fig. 4.

Fig. 4
figure 4

Two-dimensional plots of PC scores (a); three-dimensional plots of scores (b)

The chemical information contained in each principal component (PC) is shown in Fig. 5. PC1 mainly includes the absorption bands of 907, 995, 1027, 1446, 3617 and 3691 cm−1, which give mostly information on cellulose and inorganic components. In PC2, absorption bands of 906, 3617 and 3691 cm−1, give the positive contribution and are related to carbonate and kaolinite as we mentioned. But, some absorption peaks of cellulose, such as 1053, 1317, 2894 and 3329 cm−1, give negative contribution in PC2.

Fig. 5
figure 5

Loadings with no preprocess. (PC1 gives the information of 82%, PC2 affords the information of 9%)

Modeling

Only PCA couldn’t distinguish the year of the sample quite well, thus, some pattern recognitions were introduced to ascertain the accurate date of documents. Linear discriminant analysis (LDA), soft independent modeling of class analogy (SIMCA), and LS-SVM were used to establish models. The results were shown in Fig. 6. As shown in confusion matrix, the larger the number, the darker of the grid, wherein, only the correctly classified samples were on the diagonal while the misclassified samples fell in other areas. Figure 6a is the confusion matrix of LDA result, only nearly half of samples were correctly classified. The confusion matrix of SIMCA results (Fig. 6b), is better than LDA apparently. Thus, data were exhibiting non-linearity. When all variables from spectra were used to build LS-SVM model, only three samples were misclassified from confusion matrix of LS-SVM result (Fig. 6c). The classified accuracy of three algorithms was summarized in Fig. 6d. The accuracy of LDA was only 48.00%. Model built by SIMCA was with an accuracy of 62.00%. The best result was from LS-SVM, with the accuracy of 99.26%, the error rate of 0.74%.

Fig. 6
figure 6figure 6

The confusion matrix of LDA result (a); the confusion matrix of SIMCA result (b); the confusion matrix of LS-SVM result (c); the classification results of three algorithms (d)

The FTIR with LS-SVM is an effective way to identify the dating of documents. However, the specific spectral variables which contribute to the dating of documents were still unknown. Therefore, sPLS was used to extract the wavenumber variables that can decide the dating of documents. Especially, sPLS and LS-SVM were integrated together (sPLS–LS-SVM) for the excellent result of LS-SVM (99.26%). The process of sPLS–LS-SVM algorithm was operated as follows: sPLS was modeled with selected variables, and then LS-SVM was modelling on the selected variables. After sPLS, a total of 473 variables were selected as in Fig. 7a. The model was built by sPLS–LS-SVM with an accuracy of 99.01%, and the error rate of 0.99% as in Fig. 7b, only four samples were misclassified and six samples were not-assigned. The selected variables after sPLS give an outstanding effect to classify the dating of documents.

Fig. 7
figure 7

The selected variables of sPLS (a); the confusion matrix of sPLS–LS-SVM (b)

As known, only a pair of training set and test set cannot promise a convincing result, on the other hand, the selected variables would be different after every calculation. In order to acquire the stable variables, we renewed the training set and test set 100 times by Monte Carlo random sampling, and every time we executed sPLS–LS-SVM integrally. After 100 runs, the accuracy and group variables were obtained. The cycle process of sPLS–LS-SVM was shown in Fig. 8.

Fig. 8
figure 8

The cycle process of sPLS–LS-SVM

After 100 calculation runs, we got 100 accuracy. Among them, the minimum accuracy was 97.96%, the maximum accuracy was 100%; the average accuracy was 99.34% and the standard deviation was 0.43%. The model was stable because both accuracy and variance were acceptable.

The average number of 100 sets variables was around 483. A more important variable would match a higher frequency. The result of variables was shown in Fig. 9, with the wavenumber as abscissa and frequency as ordinate. The variables of 770–900, 940–1100, 1110–1820, 1800–2400, 2800–3100, and 3600–3900 cm−1 posed the high frequencies. Thus, these variables actually contributed more information in determining the date of documents. The variables of 770–900, 940–1100 and 3600–3900 cm−1 are the absorption peaks of kaolinite. The variables of 1100–1820 and 2800–3100 cm−1 are the absorption peaks of cellulose HCH (see Fig. 3). Both cellulose and inorganic components affect the dating of documents. However, the inorganic components are stable and hardly change over time, the reason for choosing inorganic components may be due to small differences between paper substrates. The variables are concentrated at 1500–2300 cm−1, which are the absorption peaks of H2O and CO2. Because the interference of H2O and CO2 can’t be avoided in the air, the influences were not investigated here.

Fig. 9
figure 9

The result of variable selection after 100 times

Conclusion

Compared with the study of Silva et al., a different solving method was used. Silva et al. used RMSECV to evaluate the model, but in this study, accuracy was used. Both two methods gave as a result that FTIR combine with chemometrics can accurately distinguish the age of paper. In this study, the journal documents from 1940 to 1980 were chosen for analysis. Compared with other documents, the journal is closer to the age of the paper itself. And every journal has an average of two hundred sheets. Subsequently, the training set and test set were renewed 100 times by Monte Carlo random sampling. Therefore, the last selected variables are more statistically significant and more stable.

From the results, it was confirmed that FTIR spectroscopy combined with sPLS–LS-SVM can be an effective method to identify the dating of documents, with satisfied classification accuracy. By contrast, the accuracy of LDA, SIMCA and LS-SVM was 48.00%, 62.00% and 99.26%, respectively. The selected variables by sPLS were 483 after 100 times, and the selected variables were focused on the absorption peaks of inorganic components and cellulose. The average of sPLS–LS-SVM accuracy was 99.34% in 100 runs.

The accuracy (99.34%) is excellent when selecting samples in each 5-year interval, however, a more precise model should be developed with samples each year in further study. In practice, the dataset should be expanded with more years of documents.